The LOTUS initiative for open knowledge management in natural products research

  1. Adriano Rutz
  2. Maria Sorokina
  3. Jakub Galgonek
  4. Daniel Mietchen
  5. Egon Willighagen
  6. Arnaud Gaudry
  7. James G Graham
  8. Ralf Stephan
  9. Roderic Page
  10. Jiří Vondrášek
  11. Christoph Steinbeck
  12. Guido F Pauli
  13. Jean-Luc Wolfender
  14. Jonathan Bisson  Is a corresponding author
  15. Pierre-Marie Allard  Is a corresponding author
  1. School of Pharmaceutical Sciences, University of Geneva, Switzerland
  2. Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, Switzerland
  3. Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Germany
  4. Institute of Organic Chemistry and Biochemistry of the CAS, Czech Republic
  5. Ronin Institute, United States
  6. Leibniz Institute of Freshwater Ecology and Inland Fisheries, Germany
  7. School of Data Science, University of Virginia, United States
  8. Department of Bioinformatics-BiGCaT, Maastricht University, Netherlands
  9. Center for Natural Product Technologies and WHO Collaborating Centre for Traditional Medicine (WHO CC/TRM), Pharmacognosy Institute; College of Pharmacy, University of Illinois at Chicago, United States
  10. Department of Pharmaceutical Sciences, College of Pharmacy, University of Illinois at Chicago, United States
  11. Ontario Institute for Cancer Research (OICR), University Ave Suite, Canada
  12. University of Glasgow, United Kingdom
  13. Department of Biology, University of Fribourg, Switzerland

Abstract

Contemporary bioinformatic and chemoinformatic capabilities hold promise to reshape knowledge management, analysis and interpretation of data in natural products research. Currently, reliance on a disparate set of non-standardized, insular, and specialized databases presents a series of challenges for data access, both within the discipline and for integration and interoperability between related fields. The fundamental elements of exchange are referenced structure-organism pairs that establish relationships between distinct molecular structures and the living organisms from which they were identified. Consolidating and sharing such information via an open platform has strong transformative potential for natural products research and beyond. This is the ultimate goal of the newly established LOTUS initiative, which has now completed the first steps toward the harmonization, curation, validation and open dissemination of 750,000+ referenced structure-organism pairs. LOTUS data is hosted on Wikidata and regularly mirrored on https://lotus.naturalproducts.net. Data sharing within the Wikidata framework broadens data access and interoperability, opening new possibilities for community curation and evolving publication models. Furthermore, embedding LOTUS data into the vast Wikidata knowledge graph will facilitate new biological and chemical insights. The LOTUS initiative represents an important advancement in the design and deployment of a comprehensive and collaborative natural products knowledge base.

Editor's evaluation

Rutz et al. describe the LOTUS initiative, an open science database that contains over 750,000 referenced structure-organism pairs. Present both the data that they have made available in Wikidata, as well as an interactive web portal, LOTUS provides a powerful platform for mining literature for published data on structure-organism pairs. The strength of this initiative lies in the effort the authors had put in creating a database that is both reproducible and usable. The result is thus a complete and user-friendly product that will respond to people's needs.

https://doi.org/10.7554/eLife.70780.sa0

Introduction

Evolution of electronic natural products resources

Natural Products (NP) research is a transdisciplinary field with wide-ranging interests: from fundamental structural aspects of naturally occurring molecular entities to their effects on living organisms and extending to the study of chemically mediated interactions within entire ecosystems. Defining the ‘natural’ qualifier is a complex task (Ducarme and Couvet, 2020; All natural, 2007). We thus adopt here a broad definition of a NP as any chemical entity found in a living organism, hereafter refered to as a structure-organism pair. An additional and fundamental element of a structure-organism pair is a reference to the experimental evidence that establishes the linkages between the chemical structure and the biological organism. A future-oriented electronic NP resource should contain fully-referenced structure-organism pairs.

Reliance on data from the NP literature presents many challenges. The assembly and integration of NP occurrences into an inter-operative platform relies primarily on access to a heterogeneous set of databases (DB) whose content and maintenance status are critical factors in this dependency (Tsugawa, 2018). A tertiary inter-operative NP platform is thus dependent on a secondary set of data that has been selectively annotated into a DB from primary literature sources. The experimental data itself reflects a complex process involving collection or sourcing of natural material (and establishment of its identity), a series of material transformation and separation steps and ultimately the chemical or spectral elucidation of isolates. The specter of human error and the potential for the introduction of biases are present at every phase of this journey. These include publication biases (Lee et al., 2013), such as emphasis on novel and/or bioactive structures in the review process, or, in DB assembly stages, with selective focus on a specific compound class or a given taxonomic range, or disregard for annotation of other relevant evidence that may have been presented in primary sources. Temporal biases also exist: a technological ‘state-of-the-art’ when published can eventually be recast as anachronistic.

The advancement of NP research has always relied on the development of new technologies. In the past century alone, the rate at which unambiguous identification of new NP entities from biological matrices can be achieved has been reduced from years to days and in the past few decades, the scale at which new NP discoveries are being reported has increased exponentially. Without a means to access and process these disparate NP data points, information is fragmented and scientific progress is impaired (Balietti et al., 2015). To this extent, contemporary bioinformatic tools enable the (re-)interpretation and (re-)annotation of (existing) datasets documenting molecular aspects of biodiversity (Mongia and Mohimani, 2021; Jarmusch et al., 2020).

While large, well-structured and freely accessible DB exist, they are often concerned primarily with chemical structures (e.g. PubChem (Kim et al., 2019), with over 100 M entries) or biological organisms (e.g. GBIF (GBIF, 2020), with over 1900 M entries), but scarce interlinkages limit their application for documentation of NP occurrence(s). Currently, no open, cross-kingdom, comprehensive and computer-interpretable electronic NP resource links NP and their containing organisms, along with referral to the underlying experimental work. This shortcoming breaks the crucial evidentiary link required for tracing information back to the original data and assessing its quality. Even valuable commercially available efforts for compiling NP data, such as the Dictionary of Natural Products (DNP), can lack proper documentation of these critical links.

Pioneering efforts to address such challenges led to the establishment of KNApSAck (Shinbo et al., 2006), which is likely the first public, curated electronic NP resource of referenced structure-organism pairs. KNApSAck (Afendi et al., 2012) currently contains 50,000+ structures and 100,000+ structure-organism pairs. However, the organism field is not standardized and access to the data is not straightforward. Another early-established electronic NP resources is the NAPRALERT dataset (Graham and Farnsworth, 2010), which was compiled over five decades from the NP literature, gathering and annotating data derived from over 200,000 primary literature sources. This dataset contains 200,000+ distinct compound names and structural elements, along with 500,000+ records of distinct, fully-cited structure-organism pairs. In total, NAPRALERT contains over 900,000 such records, due to equivalent structure-organism pairs reported in different citations. However, NAPRALERT is not an open platform and employs an access model that provides only limited free searches of the dataset. Finally, the NPAtlas (van Santen et al., 2019; van Santen et al., 2022) is a more recent project complying with the FAIR (Findability, Accessibility, Interoperability, and Reuse) guidelines for digital assets (Wilkinson et al., 2016) and offering convenient web access. While the NPAtlas allows retrieval and encourages submission of compounds with their biological source, it focuses on microbial NP and ignores a wide range of biosynthetically active organisms found in the Plantae kingdom.

The LOTUS initiative seeks to address the aforementioned shortcomings. Building on the experience gained through the establishment of the recently published COlleCtion of Open NatUral producTs (COCONUT) (Sorokina et al., 2021a) regarding the aggregation and curation of NP structural databases, this savoir-faire was expanded to accommodate biological organisms and scientific references in the equation. After extensive data curation and harmonization of over 40 electronic ressources, pairs characterizing a NP occurrence were standardized at the chemical, biological and reference levels. At its current stage of development, LOTUS disseminates 750,000+ referenced structure-organism pairs. These efforts and experiences represent an intensive preliminary curatorial phase and the first major step towards providing a high-quality, computer-interpretable knowledge base capable of transforming NP research data management from a classical (siloed) database approach to an optimally shared resource.

Accommodating principles of FAIRness and TRUSTworthiness for natural products knowledge management

In awareness of the multi-faceted pitfalls associated with implementing, using and maintaining classical scientific DBs (Helmy et al., 2016), and to enhance current and future sharing options, the LOTUS initiative selected the Wikidata platform for disseminating its resources. The idea of using wikis to disseminate databases is not new, with multiple underlying advantages (Finn et al., 2012). Since its creation, Wikidata has focused on cross-disciplinary and multilingual support. Wikidata is curated and governed collaboratively by a global community of volunteers, about 20,000 of which are contributing monthly. Wikidata currently contains more than 1 billion statements in the form of subject-predicate-object triples. Triples are machine-interpretable and can be enriched with qualifiers and references. Within Wikidata, data triples correspond to approximately 100 million entries, which can be grouped into classes as diverse as countries, songs, disasters, or chemical compounds. The statements are closely integrated with Wikipedia and serve as the source for many of its infoboxes. Various workflows have been established for reporting such classes, particularly those of interest to life sciences, such as genes, proteins, diseases, drugs, or biological taxa (Waagmeester et al., 2020).

Building on the principles and experiences described above, the present report introduces the development and implementation of the LOTUS workflow for NP occurrence curation and dissemination, which applies both FAIR and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) principles (Lin et al., 2020). LOTUS data upload and retrieval procedures ensure optimal accessibility by the research community, allowing any researcher to contribute, edit and reuse the data with a clear and open CC0 license (Creative Commons 0).

Despite many advantages, Wikidata hosting has some notable, yet manageable drawbacks. While its SPARQL query language offers a powerful way to query available data, it can also appear intimidating to the less experienced user. Furthermore, some typical queries of molecular electronic NP resources such as structural or spectral searches are not yet available in Wikidata. To address these shortcomings, LOTUS is hosted in parallel at https://lotus.naturalproducts.net (LNPN) within the naturalproducts.net ecosystem. The Natural Products Online website is a portal for open-source and open-data resources for NP research. In addition to the generalistic COCONUT and LNPN databases, the portal will enable hosting of arbitrary and skinned collections, themed in particular by species or taxonomic clade, by geographic location or by institution, together with a range of cheminformatics tools for NP research. LNPN is periodically updated with the latest LOTUS data. This dual hosting provides an integrated, community-curated and vast knowledge base (via Wikidata), as well as a NP community-oriented product with tailored search modes (via LNPN). The multiple data interaction options should establish the basis for the transparent and sustainable access, sharing and creation of knowledge on NP occurrence.

The LOTUS initiative was initially launched to answer our need to access the most comprehensive compilation of biological occurrences of NP. Indeed, we recently highlighted the interest of considering the taxonomic dimension when annotating metabolites (Rutz et al., 2019). This being said, many other concrete applications can result from an access by the scientific community to the LOTUS initiative data. For example, such a resource will facilitate the exploration of eco-evolutionary mechanisms at the molecular level (Defossez et al., 2021). In terms of drug discovery, this resource is extremely valuable to orient and guide researchers interested in a structure of interest. On the same theme, LOTUS is expected to be the perfect place to encounter ‘molecular arguments’ for biodiversity conservation (Campbell, 2003). Researchers interested in the history of science will be able, through this kind of resource, to gain a preliminary view of the temporal evolution of disciplines such as pharmacognosy. More generally, the objective of the LOTUS initiative is to prepare the ground for an electronic and globally accessible resource that would be the counterpart, at the metabolite level, of established databases linking proteins to biological organisms (e.g. Uniprot) and genes to biological organisms (Genbank). Once such an objective is reached, it will be possible to interconnect the three central objects of the living, that is metabolites, proteins and genes, through the common entity of these resources, the biological organism. Such an interconnection, fostering cross-fertilization of the fields of chemistry, biology and associated disciplines is desirable and necessary to advance towards a better understanding of Life.

Results and discussion

This section is structured as follows: first, we present an overview of the LOTUS initiative at its current stage of development. The central curation and dissemination elements of the LOTUS initiative are then explained in detail. The third section addresses the interaction modes between end-users and LOTUS, including data retrieval, addition, and editing. Some examples on how LOTUS data can be used to answer research questions or develop hypothesis are given. The final section is dedicated to the interpretation of LOTUS data and illustrates the dimensions and qualities of the current LOTUS dataset from chemical and biological perspectives.

Blueprint of the LOTUS initiative

Building on the standards established by the related WikiProjects on Wikidata (Chemistry, Taxonomy and Source Metadata), a NP chemistry-oriented subproject was created (Chemistry/Natural products). Its central data consists of three minimal sufficient objects:

  • A chemical structure object, with associated Simplified Molecular Input Line Entry System (SMILES) (Weininger, 1988), International Chemical Identifier (InChI) (Heller et al., 2013) and InChIKey (a hashed version of the InChI).

  • A biological organism object, with associated taxon name, the taxonomic DB where it was described and the taxon ID in the respective DB.

  • A reference object describing the structure-organism pair, with the associated article title and a Digital Object Identifier (DOI), a PubMed (PMID), or PubMed Central (PMCID) ID.

As data formats are largely inhomogeneous among existing electronic NP resources, fields related to chemical structure, biological organism and references are variable and essentially not standardized. Therefore, LOTUS implements multiple stages of harmonization, processing, and validation (Figure 1, stages 1–3). LOTUS employs a Single Source of Truth (SSOT, Single_source_of_truth) to ensure data reliability and continuous availability of the latest curated version of LOTUS data in both Wikidata and LNPN (Figure 1, stage 4). The SSOT approach consists of a PostgreSQL DB that structures links and data schemes such that every data element has a single place. The LOTUS processing pipeline is tailored to efficiently include and diffuse novel or curated data directly from new sources or at the Wikidata level. This iterative workflow relies both on data addition and retrieval actions as described in the Data Interaction section. The overall process leading to referenced and curated structure-organisms pairs is illustrated in Figure 1 and detailed hereafter.

Blueprint of the LOTUS initiative.

Data undergo a four-stage process: (1) Harmonization, (2) Processing, (3) Validation, and (4) Dissemination. The process was designed to incorporate future contributions (5), either by the addition of new data from within Wikidata (a) or new sources (b) or via curation of existing data (c). The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_blueprint.svg.

By design, this iterative process fosters community participation, essential to efficiently document NP occurrences. All stages of the workflow are described on the git sites of the LOTUS initiative at https://github.com/lotusnprod and in the methods. At the time of writing, 750,000+ LOTUS entries contained a curated chemical structure, biological organism and reference and were available on both Wikidata and LNPN. As the LOTUS data volume is expected to increase over time, a frozen (as of 2021-12-20), tabular version of this dataset with its associated metadata is made available at https://doi.org/10.5281/zenodo.5794106 (Rutz et al., 2021a).

Data harmonization

Multiple data sources were processed as described hereafter. All publicly accessible electronic NP resources included in COCONUT that contain referenced structure-organism pairs were considered as initial input. The data were complemented with COCONUT’s own referenced structure-organism pairs (Sorokina and Steinbeck, 2020a), as well as the following additional electronic NP resources: Dr. Duke (U.S. Department of Agriculture, 1992), Cyanometdb (Jones et al., 2021), Datawarrior (Sander et al., 2015), a subset of NAPRALERT, Wakankensaku (Wakankenaku, 2020) and DiaNat-DB (Madariaga-Mazón et al., 2021).

The contacts of the electronic NP resources not explicitly licensed as open were individually reached for permission to access and reuse data. A detailed list of data sources and related information is available as Appendix 1. All necessary scripts for data gathering and harmonization can be found in the lotus-processor repository in the src/1_gathering directory and processed is detailed in the corresponding methods section gathering section. All subsequent iterations including new data sources, either updated information from the same data sources or new data, will involve a comparison with the previously gathered data at the SSOT level to ensure that the data is only curated once.

Data processing and validation

As shown in Figure 1, data curation consisted of three stages: harmonization, processing, and validation. Thereby, after the harmonization stage, each of the three central objects – chemical compounds, biological organisms, and reference – were processed, as described in related methods section. Given the data size (2.5M+ initial entries), manual validation was unfeasible. Curating the references was a particularly challenging part of the process. Whereas organisms are typically reported by at least their vernacular or scientific denomination and chemical structures via their SMILES, InChI, InChIKey or image (not covered in this work), references suffer from largely insufficient reporting standards. Despite poor standardization of the initial reference field, proper referencing remains an indispensable way to establish the validity of structure-organism pairs. Better reporting practices, supported by tools such as Scholia (Blomqvist et al., 2017; Rasberry et al., 2019) and relying on Wikidata, Fatcat, or Semantic Scholar should improve reference-related information retrieval in the future.

In addition to curating the entries during data processing, 420 referenced structure-organism pairs were selected for manual validation. An entry was considered as valid if: (i) the structure (in the form of any structural descriptor that could be linked to the final sanitized InChIKey) was described in the reference (ii) the containing organism (as any organism descriptor that could be linked to the accepted canonical name) was described in the reference and (iii) the reference was describing the occurrence of the chemical structure in the biological organism. More details are available in the related methods section. This process allowed us to establish rules for automatic filtering and validation of the entries. The parameters of the automatic filtering are available as a function (filter_dirty.R) and are further described in the related methods section. The automatic filtering was then applied to all entries. To confirm the efficacy of the filtering process, a new subset of 100 diverse, automatically curated and automatically validated entries was manually checked, yielding a rate of 97% of true positives. The detailed results of the two manual validation steps are reported in Appendix 2. The resulting data is also available in the dataset shared at https://doi.org/10.5281/zenodo.5794106 (Rutz et al., 2021a). Table 1 shows an example of a referenced structure-organism pair before and after curation. This process resolved the structure to an InChIKey, the organism to a valid taxonomic name and the reference to a DOI, thereby completing the establishment of the essential referenced structure-organism pair.

Table 1
Example of a referenced structure-organism pair before and after curation.
StructureOrganismReference
Before curationCyathocalineStem bark of Cyathocalyx zeylanica CHAMP. ex HOOK. f. & THOMS. (Annonaceae)Wijeratne E. M. K., de Silva L. B., Kikuchi T., Tezuka Y., Gunatilaka A. A. L., Kingston D. G. I., J. Nat. Prod., 58, 459–462 (1995).
After curationVFIIVOHWCNHINZ-UHFFFAOYSA-NCyathocalyx zeylanicus10.1021 /NP50117A020

Challenging examples encountered during the development of the curation process were compiled in an edge cases table (tests/tests.tsv) to allow for automated unit testing. These tests allow a continuous revalidation of any change made to the code, ensuring that corrected errors will not reappear. The alluvial plot in Figure 2 illustrates the individual contribution of each source and original subcategory that led to the processed categories: structure, organism, and reference.

Alluvial plot of the data transformation flow within LOTUS during the automated curation and validation processes.

The figure also reflects the relative proportions of the data stream in terms of the contributions from the various sources (‘source’ block, left), the composition of the harmonized subcategories (‘original subcategory’ block, middle) and the validated data after curation (‘processed category’ block, right). Automatically validated entries are represented in green, rejected entries in blue. The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_alluvial_plot.svg.

The figure highlights, for example, the essential contribution of the reference DOI category to the final validated entries. A similar pattern can be seen concerning structures, where the validation rate of structural identifiers is higher than chemical names. The combination of the results of the automated curation pipeline and the manually curated entries led to the establishment of four categories (manually validated, manually rejected, automatically validated and automatically rejected) of the referenced structure-organism pairs that formed the processed part of the SSOT. Out of a total of 2.5M+ initial pairs, the manual and automatic validation retained 750,000+ pairs (approximately 30%), which were then selected for dissemination on Wikidata. Among validated entries, multiple ones were redundant among the source databases, thus also explaining the decrease of entries between the initial pairs and validated ones. Moreover, because data quality was favored over quantity, the number of rejected entries is high. Among them, multiple correct entries were certainly falsely rejected, but still not disseminated. All rejected entries were kept aside for later manual inspection and validation. These are publicly available at https://doi.org/10.5281/zenodo.5794597 (Rutz et al., 2021b). In the end, the disseminated data contained 290,000+ unique chemical structures, 40,000+ distinct organisms and 75,000+ references.

Data dissemination

Research worldwide can benefit the most when all results of published scientific studies are fully accessible immediately upon publication (Agosti and Johnson, 2002). This concept is considered the foundation of scientific investigation and a prerequisite for the effective direction of new research efforts based on prior information. To achieve this, research results have to be made publicly available and reusable. As computers are now the main investigation tool for a growing number of scientists, all research data including those in publications should be disseminated in computer-readable format, following the FAIR principles. LOTUS uses Wikidata as a repository for referenced structure-organism pairs, as this allows documented research data to be integrated with a large, pre-existing and extensible body of chemical and biological knowledge. The dynamic nature of Wikidata encourages the continuous curation of deposited data through the user community. Independence from individual and institutional funding represents another major advantage of Wikidata. The Wikidata knowledge base and the option to use elaborate SPARQL queries allow the exploration of the dataset from a sheer unlimited number of angles. The openness of Wikidata also offers unprecedented opportunities for community curation, which will support, if not guarantee, a dynamic and evolving data repository. At the same time, certain limitations of this approach can be anticipated. Despite (or possibly due to) their power, SPARQL queries can be complex and potentially require an in-depth understanding of the models and data structure. This involves a steep learning curve which can discourage some end-users. Furthermore, traditional ways to query electronic NP resources such as structural or spectral searches are currently not within the scope of Wikidata and, are thus addressed in LNPN. Using the pre-existing COCONUT template, LNPN hosting allows the user to perform structural searches by directly drawing a molecule, thereby addressing the current lack of such structural search possibilities in Wikidata. Since metabolite profiling by Liquid Chromatography (LC) - Mass Spectrometry (MS) is now routinely used for the chemical composition assessment of natural extracts, future versions of LOTUS and COCONUT are envisioned to be augmented by predicted MS spectra and hosted at https://naturalproducts.net to allow mass and spectral-based queries. Note that such spectral database is already available at https://doi.org/10.5281/zenodo.5607264 (Allard et al., 2021). To facilitate queries focused on specific taxa (e.g. ‘return all molecules found in the Asteraceae family’), a unified taxonomy is paramount. As the taxonomy of living organisms is a complex and constantly evolving field, all the taxon identifiers from all accepted taxonomic DB for a given taxon name were kept. Initiatives such as the Open Tree of Life (OTL) (Rees and Cranston, 2017) will help to gradually reduce these discrepancies, and the Wikidata platform can and does support such developments. OTL also benefits from regular expert curation and new data. As the taxonomic identifier property for this resource did not exist in Wikidata, its creation was requested and obtained. The property is now available as ‘Open Tree of Life ID’ (P9157).

Following the previously described curation process, all validated entries have been made available through Wikidata and LNPN. LNPN will be regularly mirroring Wikidata LOTUS through the SSOT as described in Figure 1.

User interaction with LOTUS data

The possibilities to interact with the LOTUS data are numerous. The following gives examples of how to retrieve, add and edit LOTUS data.

Data retrieval

LOTUS data can be queried and retrieved either directly in Wikidata or on LNPN, both of which have distinct advantages. While Wikidata offers flexible and powerful queries capacities at the cost of potential complexity, LNPN has a graphical user interface with capabilities of drawing chemical structures, simplified structural or biological filtering and advanced chemical descriptors, albeit with a more rigid structure. For bulk download, a frozen version of LOTUS data (timestamp of 2021-12-20) is also available at https://doi.org/10.5281/zenodo.5794106 (Rutz et al., 2021a). More refined approaches to the direct interrogation of the up-to-date LOTUS data both in Wikidata and LNPN are detailed hereafter.

Wikidata

The easiest way to search for NP occurrence information in Wikidata is by typing the name of a chemical structure directly into the ‘Search Wikidata’ field, which (for left-to-right languages) can be found in the upper right corner of the Wikidata homepage or any other Wikidata page. For example, by typing ‘erysodine’, the user will land on the page of this compound (Q27265641). Scrolling down to the ‘found in taxon’ statement will allow the user to view the biological organisms reported to contain this NP (Figure 3). Clicking the reference link under each taxon name links to the publication(s) documenting the occurrence.

Illustration of the ‘found in taxon’ statement section on the Wikidata page of erysodine Q27265641 showing a selection of erysodine-containing taxa and the references documenting these occurrences.

The typical approach to more elaborated querying involves writing SPARQL queries using the Wikidata Query Service or another direct connection to a SPARQL endpoint. Table 2 contains some examples from simple to more elaborated queries, demonstrating what can be done using this approach. The full-text queries with explanations are included in Supplementary file 1.

Table 2
Potential questions about structure-organism relationships and corresponding Wikidata queries.
QuestionWikidata SPARQL query
What are the compounds present in Mouse-ear cress (Arabidopsis thaliana) or its child taxa?https://w.wiki/4Vcv
Which organisms are known to contain β-sitosterol?https://w.wiki/4VFn
Which organisms are known to contain stereoisomers of β-sitosterol?https://w.wiki/4VFq
Which pigments are found in which taxa, according to which reference?https://w.wiki/4VFx
What are examples of organisms where compounds were found in an organism sharing the same parent taxon, but not in the organism itself?https://w.wiki/4Wt3
Which Zephyranthes species lack compounds known from at least two species in the genus?https://w.wiki/4VG3
How many compounds are structurally similar to compounds labeled as antibiotics? (grouped by the parent taxon of the containing organism)https://w.wiki/4VG4
Which organisms contain indolic scaffolds? Count occurrences, group and order the results by the parent taxon.https://w.wiki/4VG9
Which compounds with known bioactivities were isolated from Actinobacteria, between 2014 and 2019, with related organisms and references?https://w.wiki/4VGC
Which compounds labeled as terpenoids were found in Aspergillus species, between 2010 and 2020, with related references?https://w.wiki/4VGD
Which are the available referenced structure-organism pairs on Wikidata? (example limited to 1000 results)https://w.wiki/4VFh

The queries presented in Table 2 are only selected examples, and many other ways of interrogating LOTUS can be formulated. Generic queries can be used, for example, for hypothesis generation when starting a research project. For instance, a generic SPARQL query - listed in Table 2 as “Which are the available referenced structure-organism pairs on Wikidata?” - retrieves all structures, identified by their InChIKey (P235), which contain ‘found in taxon’ (P703) statements that are stated in (P248) a bibliographic reference: https://w.wiki/4VFh. Data can then be exported in various formats, such as classical tabular formats, json, or html tables (see Download tab on the lower right of the query frame). At the time of writing (2021-12-20), this query (without the LIMIT 1000) returned 951,800 entries; a frozen query result is available at https://doi.org/10.5281/zenodo.5668854 (Rutz et al., 2021d).

Targeted queries allowing to interrogate LOTUS data from the perspective of one of the three objects forming the referenced structure-organism pairs can also be built. Users can, for example, retrieve a list of all structures reported from a given organism, such as all structures reported from Arabidopsis thaliana (Q158695) or its child taxa (https://w.wiki/4Vcv). Alternatively, all organisms containing a given chemical can be queried via its structure, such as in the search for all organisms where β-sitosterol (Q121802) was found in (https://w.wiki/4VFn). For programmatic access, the lotus-wikidata-exporter repository also allows data retrieval in RDF format and as TSV tables.

To further showcase the possibilities, two additional queries were established (https://w.wiki/4VGC and https://w.wiki/4VGD). Both queries were inspired by recent literature review works (Jose et al., 2021; Zhao et al., 2022). The first work describes compounds found in Actinobacteria, with a biological focus on compounds with reported bioactivity. The second one describes compounds found in Aspergillus spp., with a chemical focus on terpenoids. In both cases, in seconds, the queries allow retrieving a table similar to the ones in the mentioned literature reviews. While these queries are not a direct substitute for manual literature review, they do allow researchers to quickly begin such a review process with a very strong body of relevant references.

For a convenient expansion or limitation of the results, certain types of queries such as structure or similarity searches usually exist in molecular electronic resources. As these queries are not natively integrated by SPARQL, they are not readily available for Wikidata exploration. To address such limitation, Galgonek et al. developed an in-house SPARQL engine that allows utilization of Sachem, a high-performance chemical DB cartridge for PostgreSQL for fingerprint-guided substructure and similarity search (Kratochvíl et al., 2018). The engine is used by the Integrated Database of Small Molecules (IDSM) that operates, among other things, several dedicated endpoints allowing structural search in selected small-molecule datasets via SPARQL (Kratochvíl et al., 2019). To allow substructure and similarity searches via SPARQL also on compounds from Wikidata, a dedicated IDSM/Sachem endpoint was created for the LOTUS project. The endpoint indexes isomeric (P2017) and canonical (P233) SMILES code available in Wikidata. To ensure that data is kept up-to-date, SMILES codes are automatically downloaded from Wikidata daily. The endpoint allows users to run federated queries and, thereby, proceed to structure-oriented searches on the LOTUS data hosted at Wikidata. For example, the SPARQL query https://w.wiki/4VG9 returns a list of all organisms that produce NP with an indolic scaffold. The output is aggregated at the parent taxa level of the containing organisms and ranked by the number of scaffold occurrences.

Regarding the versioning aspects, some challenges are implied by the dynamic nature of the Wikidata environment. However, tracking of the data evolution can be achieved in multiple ways and at different levels: at the full Wikidata level, dumps are regularly created (https://dumps.wikimedia.org/wikidatawiki/entities) while at the individual entry level the full history of modification can be consulted (see following link for the full edit history of erythromycin for example (https://www.wikidata.org/w/index.php?title=Q213511&action=history)).

We propose to the users a simple approach to document, version and share the output of queries on the LOTUS data at a defined time point. For this, in addition to sharing the short url of a the SPARQL query (which will return results evolving over time), a simple archiving of the returned table to Zenodo or similar platform can be done. In order to gather results of SPARQL queries, we established the LOTUS Initiative Community repository. The following link allows to directly contribute to the community repository https://zenodo.org/deposit/new?c=the-lotus-initiative. For example, the output of this Wikidata SPARQL query https://w.wiki/4N8G realized on the 2021-11-10T16:56 can be easily archived and shared in a publication via its DOI 10.5281/zenodo.5668380.

Lotus.naturalproducts.net (LNPN)

In the search field of the LNPN interface (https://lotus.naturalproducts.net), simple queries can be achieved by typing the molecule name (e.g. ibogaine) or pasting a SMILES, InChI, InChIKey string, or a Wikidata identifier. All compounds reported from a given organism can be found by entering the organism name at the species or any higher taxa level (e.g. Tabernanthe iboga). Compound search by chemical class is also possible.

Alternatively, a structure can be directly drawn in the structure search interface (https://lotus.naturalproducts.net/search/structure), where the user can also decide on the nature of the structure search (exact, similarity, substructure search). Refined search mode combining multiple search criteria, in particular physicochemical properties, is available in the advanced search interface (https://lotus.naturalproducts.net/search/advanced).

Within LNPN, LOTUS bulk data can be retrieved as SDF or SMILES files, or as a complete MongoDB dump via https://lotus.naturalproducts.net/download. Extensive documentation describing the search possibilities and data entries is available at https://lotus.naturalproducts.net/documentation. LNPN can also be queried via the application programming interface (API) as described in the documentation.

Data addition and evolution

One major advantage of the LOTUS architecture is that every user has the option to contribute to the NP occurrences documentation effort by adding new or editing existing data. As all LOTUS data applies the SSOT mechanism, reprocessing of previously treated elements is avoided. However, at the moment, the SSOT channels are not open to the public for direct write access to maintain data coherence and evolution of the SSOT scheme. For now, the users can employ the following approaches to add or modify data in LOTUS.

Sources

LOTUS data management involves regular re-importing of both current and new data sources. New and edited information from these electronic NP resources will be checked against the SSOT. If absent or different, data will be passed through the curation pipeline and subsequently stored in the SSOT. Accordingly, by contributing to external electronic NP resources, any researcher has a means of providing new data for LOTUS, keeping in mind the inevitable delay between data addition and subsequent inclusion into LOTUS.

Wikidata

The currently favored approach to add new data to LOTUS is to create or edit Wikidata entries directly. Newly created or edited data will then be imported into the SSOT. There are several ways to interact with Wikidata which depend on the technical skills of the user and the volume of data to be uploaded/modified.

Pre-requisites

While direct Wikidata upload is possible, contributors are encouraged to use the LOTUS curation pipeline as a preliminary step to strengthen the initial data quality. For this, a specific mode of the LOTUS processor can be called (see Custom mode). The added data will therefore benefit from the curation and validation stages implemented in the LOTUS processing pipeline.

Manual upload

Any researcher interested in reporting NP occurrences can manually add the data directly in Wikidata, without any particular technical knowledge requirement. For this the creation of a Wikidata account and following the general object editing guidelines is advised. Regarding the addition of NP-centered objects (i.e. referenced structure-organisms pairs), users shall refer to the WikiProject Chemistry/Natural products group page.

A tutorial for the manual creation and upload of a referenced structure-organism pair to Wikidata is available in Supplementary file 2.

Batch and automated upload

Through the initial curation process described previously, 750,000+ referenced structure-organism pairs were validated for Wikidata upload. To automate this process, a set of programs were written to automatically process the curated outputs, group references, organisms and compounds, check if they are already present in Wikidata (using SPARQL and direct Wikidata querying) and insert or update the entities as needed (i.e. upserting). These scripts can be used for future batch upload of properly curated and referenced structure-organism pairs to Wikidata. Programs for data addition to Wikidata can be found in the repository lotus-wikidata-interact. The following Xtools page offers an overview of the latest activity performed by our NPimporterBot, using those programs.

Data editing

Even if correct at a given time point, scientific advances can invalidate or update previously uploaded data. Thus, the possibility to continuously edit the data is desirable and guarantees data quality and sustainability. Community-maintained knowledge bases such as Wikidata encourage such a process. Wikidata presents the advantage of allowing both manual and automated correction. Field-specific robots such as SuccuBot, KrBot, Pi_bot and ProteinBoxBot or our NPimporterBot went through an approval process. The robots are capable of performing thousands of edits without the need for human input. This automation helps reduce the amount of incorrect data that would otherwise require manual editing. However, manual curation by human experts remains irreplaceable as a standard. Users who value this approach and are interested in contributing are invited to follow the manual curation tutorial in Supplementary file 2.

The Scholia platform provides a visual interface to display the links among Wikidata objects such as researchers, topics, species or chemicals. It now provides an interesting way to view the chemical compounds found in a given biological organism (see here for the metabolome view of Eurycoma longifolia). If Scholia currently does not offer a direct editing interface for scientific references, it still allows users to proceed to convenient batch editing via QuickStatements. The adaptation of such a framework to edit the referenced structure-pairs in the LOTUS initiative could thus facilitate the capture of future expert curation, especially manual efforts that cannot be replaced by automated scripts.

Data interpretation

To illustrate the nature and dimensions of the LOTUS dataset, some selected examples of data interpretation are shown. First, the distribution of chemical structures among four important NP reservoirs: plants, fungi, animals, and bacteria (Table 3). Then, the distribution of biological organisms according to the number of related chemical structures and likewise the distribution of chemical structures across biological organisms are illustrated (Figure 4). Furthermore, the individual electronic NP resources participation in LOTUS data is resumed using the UpSet plot depiction, which allows the visualization of intersections in data sets (Figure 5). Across these figures we take again the two previous examples, i.e, β-sitosterol as chemical structure and Arabidopsis thaliana as biological organism because of their well-documented statuses. Finally, a biologically interpreted chemical tree and a chemically-interpreted biological tree are presented (Figures 6 and 7). The examples illustrate the overall chemical and biological coverage of LOTUS by linking family-specific classes of chemical structures to their taxonomic position. Table 3, Figures 4, 6 and 7 were generated using the frozen data (2021-12-20 timestamp), which is available for download at https://doi.org/10.5281/zenodo.5794106 (Rutz et al., 2021a). Figure 5 required a dataset containing information from closed resources and the complete data used for its generation is therefore not available for public distribution. All scripts used for the generation of the figures are available in the lotus-processor repository in the src/4_visualizing directory for reproducibility.

Distribution of chemical structures across reported biological organisms in LOTUS

Table 3 summarizes the distribution of chemical structures and their chemical classes (according to NPClassifier Kim et al., 2021) across the biological organisms reported in LOTUS. For this, biological organisms were grouped into four artificial taxonomic levels (plants, fungi, animals, and bacteria). These were built by combining the two highest taxonomic levels in the OTL taxonomy, namely Domain and Kingdom levels. “Plants” corresponded to “Eukaryota_Archaeplastida”, “Fungi” to “Eukaryota_Fungi”, “Animals” to “Eukaryota_Metazoa” and “Bacteria” to “Bacteria_NA”. The category corresponding to “Eukaryota_NA” mainly contained Algae, but also other organisms such as Amoebozoa and was therefore excluded. This represented less than 1% of all entries. The details of this process are available under src/3_analyzing/structure_taxon_distribution.R. When the chemical structure/class was reported only in one taxonomic grouping, it was counted as ‘specific’.

Table 3
Distribution and specificity of chemical structures across four important NP reservoirs: plants, fungi, animals, and bacteria.

When the chemical structure/class appeared only in one group and not the three others, they were counted as ‘specific’. Chemical classes were attributed with NPClassifier.

GroupOrganisms2D Structure-Organism pairs2D chemical structuresSpecific 2D chemical structuresChemical classesSpecific chemical classes
Plantae28,439342,89195,19190,672 (95%)54559 (11%)
Fungi4,00336,95022,59420,194 (89%)41719 (5%)
Animalia2,71624,11415,24211,822 (78%)45514 (3%)
Bacteria1,55523,19815,89514,130 (89%)38543 (11%)

Distributions of organisms per structure and structures per organism

Readily achievable outcomes from LOTUS show that the depth of exploration of the world of NP is rather limited: as depicted in Figure 4, on average, three organisms are reported per chemical structure and eleven structures per organism. Notably, half of all structures have been reported from a single organism and half of all studied organisms are reported to contain five or fewer structures. Metabolomics studies suggest that these numbers are heavily underrated (Noteborn et al., 2000; Wang et al., 2020) and indicate that a better reporting of the metabolites detected in the course of NP chemistry investigations should greatly improve coverage.

This incomplete coverage may be partially explained by the habit in classical NP journals to accept only new and/or bioactive chemical structures for publication. Another possible explanation is the fact that specific chemical classes have been under heavier scrutiny by the natural products community than others. For example, alkaloids have three specific characteristics which favor their reporting in the literature. First, they are often endowed with potent biological activities making them a target in the frame of pharmacognosy research. Second, their chemical nature makes them readily accessible from complex biological matrices through acido-basic extraction. Third, they ionize greatly in positive MS mode, which makes their detection even at a very low concentration possible, where other compounds present in much higher concentrations are not detected. It is thus a complex task to answer the following question: “Is the currently observed repartition of alkaloids across the tree of life a reflection of their true biological occurrence or is this repartition biased by the aforementioned characteristics of this chemical class?” While the LOTUS initiative does not allow yet disentangling the bias from the true occurrence, it should offer sound and strong foundations for such challenging research problematic.

Another obvious explanation for the limited coverage (see Figure 4) is the fact that most of the chemical structures in LOTUS have been physically isolated and described. This is an extremely time-consuming effort that can obviously not be carried on all metabolites of all biological organisms. Here, the sensitivity of mass spectrometry and the ever-increasing efficiency of computational metabolite annotation solutions could offer a strong take. The documentation of metabolite annotation results obtained on large collections of biological matrices and the associated metadata within knowledge graphs offers exciting perspectives regarding the possibilities to expand both the chemical and biological coverage of the LOTUS data in a feasible manner.

Distribution of ‘structures per organism’ and ‘organisms per structure’.

The number of organisms linked to the planar structure of β-sitosterol (KZJWDPNRJALLNS) and the number of chemical structures in Arabidopsis thaliana are two exemplary highlights. A. thaliana contains 687 different short InChIKeys (i.e. 2D structures) and KZJWDPNRJALLNS is reported in 3979 distinct organisms. Less than 10% of the species contain more than 80% of the structural diversity present within LOTUS. In parallel, 80% of the species present in LOTUS are linked to less than 10% of the structures. The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_structure_organism_distribution.svg.

Contribution of individual electronic NP resources to LOTUS

The added value of the LOTUS initiative to assemble multiple electronic NP resources is illustrated in Figure 5: Panel A shows the contributions of the individual electronic NP resources to the ensemble of chemical structures found in one of the most studied vascular plants, Arabidopsis thaliana (“Mouse-ear cress”; Q147096). Panel B shows the ensemble of taxa reported to contain the planar structure of the widely occurring triterpenoid β-sitosterol (Q121802).

UpSet plots of the individual contribution of electronic NP resources to the planar structures found in Arabidopsis thaliana (A) and to organisms reported to contain the planar structure of β-sitosterol (KZJWDPNRJALLNS) (B).

UpSet plots are evolved Venn diagrams, allowing to represent intersections between multiple sets. The horizontal bars on the lower left represent the number of corresponding entries per electronic NP resource. The dots and their connecting line represent the intersection between source and consolidate sets. The vertical bars indicate the number of entries at the intersection. For example, 479 organisms containing the planar structure of β-sitosterol are present in both UNPD and NAPRALERT, whereas each of them respectively reports 1349 and 2330 organisms containing the planar structure of β-sitosterol. The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_upset_plot.svg.

Figure 5A also shows that according to NPClassifier, the ‘chemical pathway’ category distribution across electronic NP resources is not conserved. Note that NPClassifier and ClassyFire (Djoumbou Feunang et al., 2016) chemical classification results are both available as metadata in the frozen LOTUS export and LNPN. Both classification tools return a chemical taxonomy for individual structures, thus allowing their grouping at higher hierarchical levels, in the same way as it is done for biological taxonomies. The UpSet plot in Figure 5 indicates the poor overlap of preexisting electronic NP resources and the added value of an aggregated dataset. This is particularly well illustrated in Figure 5B., where the number of organisms for which the planar structure of β-sitosterol (KZJWDPNRJALLNS) has been reported is shown for each intersection. NAPRALERT has by far the highest number of entries (2330 in total), while other electronic NP resources complement this well: e.g. UNPD has 573 reported organisms with β-sitosterol that do not overlap with any other resource. Of note, β-sitosterol is documented in only 13 organisms in the DNP, highlighting the importance of a better systematic reporting of ubiquitous metabolites and the interest of multiple data sources agglomeration.

A biologically interpreted chemical tree

The chemical diversity captured in LOTUS is here displayed using TMAP (Figure 6), a visualization library allowing the structural organization of large chemical datasets as a minimum spanning tree (Probst and Reymond, 2020). Using Faerun, an interactive HTML file is generated to display metadata and molecule structures by embedding the SmilesDrawer library (Probst and Reymond, 2018a; Probst and Reymond, 2018b). Planar structures were used for all compounds to generate the TMAP (chemical space tree-map) using MAP4 encoding (Capecchi et al., 2020). As the tree organizes structures according to their molecular fingerprint, an anticipated coherence between the clustering of compounds and the mapped NPClassifier chemical class is observed (Figure 6A.). For clarity, some of the most represented chemical classes of LOTUS plus quassinoids and stigmastane steroids are mapped, with examples of a quassinoid (NXZXPYYKGQCDRO) (light green star) and a stigmastane steroid (KZJWDPNRJALLNS) (dark green diamond) and their corresponding location in the TMAP.

TMAP visualizations of the chemical diversity present in LOTUS.

Each dot corresponds to a chemical structure. A highly specific quassinoid (NXZXPYYKGQCDRO) (light green star) and an ubiquitous stigmastane steroid (KZJWDPNRJALLNS) (dark green diamond) are mapped as examples in all visualizations. In panel A., compounds (dots) are colored according to the NPClassifier chemical class they belong to. In panel B., compounds that are mostly reported in the Simaroubaceae family are highlighted in blue. Finally, in panel C., the compounds are colored according to the specificity score of chemical classes found in biological organisms. This biological specificity score at a given taxonomic level for a given chemical class is calculated as a Jensen-Shannon divergence. A score of 1 suggests that compounds are highly specific, 0 that they are ubiquitous. Zooms on a group of compounds of high biological specificity score (in pink) and on compounds of low specificity (blue) are depicted. An interactive HTML visualization of the LOTUS TMAP is available at https://lotus.nprod.net/post/lotus-tmap/ and archived at https://doi.org/10.5281/zenodo.5801807 (Rutz and Gaudry, 2021). The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_biologically_interpreted_chemical_tree.svg.

To explore relationships between chemistry and biology, it is possible to map taxonomical information such as the most reported biological family per chemical compound (Figure 6B.) or the biological specificity of chemical classes (Figure 6C.) on the TMAP. The biological specificity score at a given taxonomic level for a given chemical class is calculated as a Jensen-Shannon divergence. A score of 1 suggests that compounds are highly specific, 0 that they are ubiquitous. For more details, see 3_analyzing/jensen_shannon_divergence.R. This visualization allows to highlight chemical classes specific to a given taxon, such as the quassinoids in the Simaroubaceae family. In this case, it is striking to see how well the compounds of a given chemical class (quassinoids) (Figure 6A.) and the most reported plant family per compound (Simaroubaceae) (Figure 6B.) overlap. This is also evidenced on Figure 6C. with a Jensen-Shannon divergence of 0.99 at the biological family level for quassinoids. In this plot, it is also possible to identify chemical classes that are widely spread among living organisms, such as the stigmastane steroids, which exhibit a Jensen-Shannon divergence of 0.73 at the biological family level. This means that repartition of stigmastane steroids among families is not specific. Figure 7—figure supplement 1 further supports this statement.

A chemically interpreted biological tree

An alternative view of the biological and chemical diversity covered by LOTUS is illustrated in Figure 7. Here, chemical compounds are not organized but biological organisms are placed in their taxonomy. To limit bias due to under-reporting in the literature and keep a reasonable display size, only families with at least 50 reported compounds were included. Organisms were classified according to the OTL taxonomy and structures according to NPClassifier. The tips were labeled according to the biological family and colored according to their biological kingdom. The bars represent structure specificity of the most characteristic chemical class of the given biological family (the higher the more specific). This specificity score was a Jaccard index between the chemical class and the biological family. For more details, see 4_visualizing/plot_magicTree.R.

Figure 7 with 1 supplement see all
LOTUS provides new means of exploring and representing chemical and biological diversity.

The tree generated from current LOTUS data builds on biological taxonomy and employs the kingdom as tips label color (only families containing 50+ chemical structures were considered). The outer bars correspond to the most specific chemical class found in the biological family. The height of the bar is proportional to a specificity score corresponding to a Jaccard index between the chemical class and the biological family. The bar color corresponds to the chemical pathway of the most specific chemical class in the NPClassifier classification system. The size of the leaf nodes corresponds to the number of genera reported in the family. The figure is vectorized and zoomable for detailed inspection and is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_chemically_interpreted_biological_tree.svg.

Figure 7 makes it possible to spot highly specific compound classes such as trinervitane terpenoids in the Termitidae, the rhizoxin macrolides in the Rhizopodaceae, or the quassinoids and limonoids typical, respectively, of Simaroubaceae and Meliaceae. Similarly, tendencies of more generic occurrence of NP can be observed. For example, within the fungal kingdom, Basidiomycotina appear to have a higher biosynthetic specificity toward terpenoids than other fungi, which mostly focus on polyketides production. As explained previously, Figure 7 is highly dependent of the data reported in literature. As also illustrated in Figure 4, some compounds can be over-studied among several organisms, and many organisms studied for specific compounds only. This is a direct consequence of the way the NP community reports its data actually. Having this in mind, when observed at a finer scale, down to the structure level, such chemotaxonomic representation can give valuable insights. For example, among all chemical structures, only two were found in all biological kingdoms, namely heptadecanoic acid (KEMQGTRYUADPNZ-UHFFFAOYSA-N) and β-carotene (OENHQHLEOONYIE-JLTXGRSLSA-N). Looking at the distribution of β-sitosterol (KZJWDPNRJALLNS-VJSFXXLFSA-N) within the overall biological tree, Figure 7—figure supplement 1 plots its presence/absence versus those of its superior chemical classifications, namely the stigmastane, steroid and terpenoid derivatives, over the same tree used in Figure 7. The comparison of these five chemically interpreted biological trees clearly highlights the increasing speciation of the β-sitosterol biosynthetic pathway in the Archaeplastida kingdom, while the superior classes are distributed across all kingdoms. Figure 7 is zoomable and vectorized for detailed inspection.

As illustrated, the possibility of data interrogation at multiple precision levels, from fully defined chemical structures to broader chemical classes, is of great interest, for example, for taxonomic and evolution studies. This makes LOTUS a unique resource for the advancement of chemotaxonomy, a discipline pioneered by Augustin Pyramus de Candolle and pursued by other notable researchers (Robert Hegnauer, Otto R. Gottlieb) (Gottlieb, 1982; Hegnauer, 1986a; Candolle, 1816). Six decades after Hegnauer’s publication of ‘Die Chemotaxonomie der Pflanzen’ (Hegnauer, 1986b) much remains to be done for the advancement of this field of study and the LOTUS initiative aims to provide a solid basis for researchers willing to pursue these exciting explorations at the interface of chemistry, biology and evolution.

As shown recently in the context of spectral annotation (Dührkop et al., 2021), lowering the precision level of the annotation allows a broader coverage along with greater confidence. Genetic studies investigating the pathways involved and the organisms carrying the responsible biosynthetic genes would be of interest to confirm the previous observations. These forms of data interpretation exemplify the importance of reporting not only new structures, but also novel occurrences of known structures in organisms as comprehensive chemotaxonomic studies are pivotal for a better understanding of the metabolomes of living organisms.

The integration of multiple knowledge sources, for example genetics for NP producing gene clusters (Kautsar et al., 2020) combined to taxonomies and occurrences DB, also opens new opportunities to understand if an organism is responsible for the biosynthesis of a NP or merely contains it. This understanding is of utmost importance for the chemotaxonomic field and will help to understand to which extent microorganisms (endosymbionts) play a role in host development and its NP expression potential (Saikkonen et al., 2004).

Conclusion and Perspectives

Advancing natural products knowledge

At its current development stage, data harmonized and curated throughout the LOTUS initiative remains imperfect and, by the very nature of research, at least partially biased (see Introduction). In the context of bioactive NP research, and due to global editorial practices, it should not be ignored that many publications tend to emphasize new compounds and/or those for which interesting bioactivity has been measured. Near-ubiquitous (primarily plant-based) compounds, if broadly bioactive, tend to be overrepresented in the NP literature, yet the implication of their wide distribution in nature and associated patterns of non-specific activity are often underappreciated (Bisson et al., 2016b). Ideally, all characterized compounds independent of structural novelty and/or bioactivity profile should be documented, and the sharing of verified structure-organism pairs is fundamental to the advancement of NP research.

The LOTUS initiative provides a framework for rigorous review and incorporation of new records and already presents a valuable overview of the distribution of NP occurrences studied to date. While current data presents a reasonable approximation of the chemistries of a few well-studied organisms such as Arabidopsis thaliana, it remains patchy for many other organisms represented in the dataset. Community participation is the most efficient means of achieving a better documentation of NP occurrences, and the comprehensive editing opportunities provided within LOTUS and through the associated Wikidata distribution platform open new opportunities for such collaborative engagement. In addition to facilitating the introduction of new data, it also provides a forum for critical review of existing data (see an example of a Wikidata Talk page here), as well as harmonization and verification of existing NP datasets as they come online.

Fostering FAIRness and TRUSTworthiness

The LOTUS harmonized data and dissemination of referenced structure-organism pairs through Wikidata, enables novel forms of queries and transformational perspectives in NP research. As LOTUS follows the guidelines of FAIRness and TRUSTworthiness, all researchers across disciplines can benefit from this opportunity, whether the interest is in ecology and evolution, chemical ecology, drug discovery, biosynthesis pathway elucidation, chemotaxonomy, or other research fields connected with NP.

Researchers worldwide uniformly acknowledge the limitations caused by the intrinsic unavailability of essential (raw) data (Bisson et al., 2016a). In addition to being FAIR, LOTUS data is also open with a clear license, while closed data is still a major impediment to advancement of science (Murray-Rust, 2008). The lack of progress in such direction is partly due to elements in the dissemination channels of the classical print and static PDF publication formats that complicate or sometimes even discourage data sharing, for example, due to page limitations and economically motivated mechanisms, including those involved in the focus on and calculation of journal impact factors. In particular raw data such as experimental readings, spectroscopic data, instrumental measurements, statistical, and other calculations are valued by all, but disseminated by only very few. The immense value of raw data and the desire to advance the public dissemination has recently been documented in detail for nuclear magnetic resonance (NMR) spectroscopic data by a large consortium of NP researchers (McAlpine et al., 2019). However, to generate the vital flow of contributed data, the effort associated with preparing and submitting content to open repositories as well as data reuse should be better acknowledged in academia, government, regulatory, and industrial environments (Cousijn et al., 2019; Cousijn et al., 2018; Pierce et al., 2019). The introduction of LOTUS provides here a new opportunity to advance the FAIR guiding principles for scientific data management and stewardship (Wilkinson et al., 2016).

Opening new perspectives for spectral data

The possibilities for expansion and future applications of the Wikidata-based LOTUS initiative are significant. For example, properly formatted spectral data (e.g. obtained by MS or NMR) can be linked to the Wikidata entries of the originating chemical compounds. MassBank (Horai et al., 2010) and SPLASH (Wohlgemuth et al., 2010) identifiers are already reported in Wikidata, and this existing information can be used to report MassBank or SPLASH records for example for Arabidopsis thaliana compounds (https://w.wiki/3PJD). Such possibilities will help to bridge experimental data results obtained during the early stages of NP research with data that has been reported and formatted in different contexts. This opens exciting perspectives for structural dereplication, NP annotation, and metabolomic analysis. The authors have previously demonstrated that taxonomically informed metabolite annotation is critical for the improvement of the NP annotation process (Rutz et al., 2019). Alternative approaches linking structural annotation to biological organisms have also shown substantial improvements (Hoffmann et al., 2021). In this context, the LOTUS initiative offers new opportunities for linking chemical objects to both their biological occurrences and spectral information and should significantly facilitate such applications.

Integrating chemodiversity, biodiversity, and human health

As shown in Figure 7—figure supplement 1, observing the chemical and biological diversity at various granularities offers new insights. Regarding the chemical objects involved, it will be important to document the taxonomies of chemical annotations for the Wikidata entries. However, this is a rather complex task, for which stability and coverage issues will have to be addressed first. Existing chemical taxonomies such as ChEBI, ClassyFire, or NPClassifier are evolving steadily, and it will be important to constantly update the tools used to make further annotations. Promising efforts have been undertaken to automate the inclusion of Wikidata structures into a chemical ontology. Such approach exploits the SMILES and SMARTS associated properties to infer a chemical classification for the structure. See for example, the entry related to emericellolide B. Repositioning NP within their greater biosynthetic context is another major challenge - and active field of research. The fact that the LOTUS initiative disseminates its data through Wikidata will help facilitate further integration with biological pathway knowledge bases such as WikiPathways and contribute to this complex task (Martens et al., 2021; Slenter et al., 2018).

In the field of ecology, molecular traits are gaining increased attention (Kessler and Kalske, 2018; Sedio, 2017; Taylor and Dunn, 2018). The LOTUS architecture can help to associate classical plant traits (e.g. leaf surface area, photosynthetic capacities, etc.) with Wikidata biological organisms entries, and, thus, allow their integration and comparison with chemicals that are associated with the organisms. Likewise, the association of biogeography data documented in repositories such as GBIF could be further exploited in Wikidata to pursue the exciting but understudied topic of ‘chemodiversity hotspots’ (Defossez et al., 2021).

Other NP-related information of great interest remains poorly formatted. One example of such a shortcoming relates to traditional medicine (and the field studying it: ethnomedicine and ethnobotany), which is the historical and empiric approach of mankind to discover and use bioactive products from Nature, primarily plants. The amount of knowledge generated in human history on the use of medicinal substances represents a fascinating yet underutilized amount of information. Notably, the body of literature on the pharmacology and toxicology of NP is compound-centric, increases steadily, and relatively scattered, but still highly relevant for exploring the role and potential utility of NP for human health. To this end, the LOTUS initiative represents a potential framework for new concepts by which such information could be valued and conserved in the digital era (Allard et al., 2018; Cordell, 2017a; Cordell, 2017b). This underscores the transformative value of the LOTUS initiative for the advancement of traditional medicine and its interest for drug discovery in health systems worldwide.

Shortcomings and challenges

Despite these strong advantages, the establishment and functioning of the LOTUS curation pipeline is not devoid of defaults and we list hereafter some of the observed shortcomings and associated challenges.

First, the LOTUS processing pipeline is heavy. It includes many dependencies and is convoluted. We tried to simplify the process and associated programs as much as possible but they remain consequent. This is the consequence of the heterogeneous nature of the source information and the number of successive operations required to process the data.

Second, while the overall objective of the LOTUS processing pipeline is to increase data quality, the pipeline also transforms data during the process, and, in some cases, data quality can be degraded or errors can be propagated. For example, regarding the chemical objects, the processing pipeline performs a systematic sanitization step that includes salt removal, uncharging of molecules and dimers resolving. We decided to apply this step systematically after observing a high ratio of artifacts within salts, charged or dimeric molecules. This thus implies that correct salts, charged or dimeric molecules in the input data will suffer an unwanted ‘sanitization’ step. Also, the LOTUS processing step uses external libraries and tools for the automated ‘name to structure’ and ‘structure to name’ translations. These remain challenging as they rely on sets of predefined rules which do not cover all cases and can commonly lead to incorrect translations.

On the biological organisms curation side, we are aware of shortcomings, whether inherent to specific inputs or regarding limitations of the general process. Regarding inputs, some cases are clearly not resolvable except through human curation. For example, the word Lotus can refer both to the genus of a plant of the Fabaceae family (https://www.wikidata.org/wiki/Q3645698) or to the vernacular name of Nelumbo nucifera (https://www.wikidata.org/wiki/Q16528). In fact, the name of the LOTUS Initiative comes, in part, from this taxonomic curiosity - and the challenge for its automated curation. To give another striking illustration, Ficus variegata corresponds both to a plant (https://www.wikidata.org/wiki/Q5446649) and to a mollusc (https://www.wikidata.org/wiki/Q502030). For specific names coming from traditional Chinese medicine or other sources using vernacular names, translation was dependent on hand curated dictionaries, which are clearly not exhaustive. Additionally, it is noteworthy to remind that the validation of the processed entries relies in part on partly imperfect rules, thus leading to erroneous entries in the output data. However, we also deliberately kept those rules restrictive in order to overall favor quality over quantity (see Figure 2).

Thus, despite our efforts, there is no doubt that incorrect structure-organism pairs have been uploaded on Wikidata (and some correct ones have not). We however expect that the editing facilities offered by this platform and community efforts will, over time, improve data quality.

Summary and outlook

Despite these challenges, the various facets discussed above connect with ongoing and future developments that the tandem of the LOTUS initiative and its Wikidata integration can accommodate through a broader knowledge base. The information of the LOTUS initiative is already readily accessible by third party projects build on top of Wikidata such as the SLING project (https://github.com/ringgaard/sling, see entry for gliotoxin) or the Plant Humanities Lab project (https://lab.plant-humanities.org, see entry for Ilex guayusa in the ‘From Related Items’ section). LOTUS data has also been integrated to PubChem (https://pubchem.ncbi.nlm.nih.gov/source/25132) to complement the natural products related metadata of this major chemical DB. For an example, see Gentiana lutea.

Behind the scenes, all underlying resources represent data in a multidimensional space and can be extracted as individual graphs, which can then be interconnected. The craft of appropriate federated queries allows users to navigate these graphs and fully exploit their potential (Waagmeester et al., 2020; Kratochvíl et al., 2018). The development of interfaces such as RDFFrames (Mohamed et al., 2020) will also facilitate the use of the wide arsenal of existing machine learning approaches to automate reasoning on these knowledge graphs.

Overall, the LOTUS initiative aims to make more and better data available. While we did our best to ensure high data quality, the current processing pipeline still removes a lot of correct entries and misses or induces some incorrect ones. Aware of those imperfections, our project hopefully paves the way for the establishment of an open, durable and expandable electronic NP resource. The design and efforts of the LOTUS initiative reflect our conviction that the integration of NP research results is long-needed and requires a truly open and FAIR approach to information dissemination, with high quality data directly flowing from its source to public knowledge bases. We believe that the LOTUS initiative has the potential to fuel a virtuous cycle of research habits and, as a result, contribute to a better understanding of Life and its chemistry.

Materials and methods

Key resources table
Reagent type (species) or resourceDesignationSource or referenceIdentifiersAdditional information
Software, algorithmLotus-processor codeThis work (https://github.com/lotusnprod/lotus-processor, Rutz, 2022a)Archived at https://doi.org/10.5281/zenodo.5802107
Software, algorithmLotus-web codeThis work (https://github.com/lotusnprod/lotus-web, Rutz, 2022b)Archived at https://doi.org/10.5281/zenodo.5802119
Software, algorithmLotus-wikidata-interact codeThis work (https://github.com/lotusnprod/lotus-wikidata-interact, Rutz, 2022c)Archived at https://doi.org/10.5281/zenodo.5802113
Software, algorithmGlobal Names Architeturehttps://globalnames.orgQID:Q65691453See Additional executable files
Software, algorithmJavahttps://www.java.comQID:Q251
Software, algorithmKotlinhttps://kotlinlang.orgQID:Q3816639See Kotlin packages
Software, algorithmManubothttps://manubot.orgQID:Q96473455 RRID:SCR_018553Repository available at https://github.com/lotusnprod/lotus-manuscript
Software, algorithmNPClassifierhttps://npclassifier.ucsd.eduSee https://doi.org/10.1021/acs.jnatprod.1c00399
Software, algorithmOPSINhttps://github.com/dan2097/opsinQID:Q26481302See Additional executable files
Software, algorithmPython Programming Languagehttps://www.python.orgQID:Q28865 RRID:SCR_008394See Python packages
Software, algorithmR Project for Statistical Computinghttps://www.r-project.orgQID:Q206904 RRID:SCR_001905See R packages
Software, algorithmMolconverthttps://docs.chemaxon.com/display/docs/molconvert.mdQID:Q55377678See Chemical structures
Software, algorithmWikidatahttps://www.wikidata.orgQID:Q2013 RRID:SCR_018492Project page https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/Natural_products
OtherLotus custom dictionariesThis workArchived at https://doi.org/10.5281/zenodo.5801798
OtherChemical identifier resolverhttps://cactus.nci.nih.gov/chemical/structureSee Chemical structures
OtherCrossRefhttps://www.crossref.orgQID:Q5188229 RRID:SCR_003217See References
OtherPubChemhttps://pubchem.ncbi.nlm.nih.govQID:Q278487 RRID:SCR_004284LOTUS data https://pubchem.ncbi.nlm.nih.gov/source/25132
OtherPubMedhttps://pubmed.ncbi.nlm.nih.govQID:Q180686 RRID:SCR_004846See References
OtherTaxonomic data sourceshttps://resolver.globalnames.org/data_sourcesSee Translation
OtherNatural Products data sourcesSee Appendix 1

Data gathering

Request a detailed protocol

Before their inclusion, the overall quality of the source was manually assessed to estimate, both, the quality of referenced structure-organism pairs and the lack of ambiguities in the links between data and references. This led to the identification of thirty-six electronic NP resources as valuable LOTUS input. Data from the proprietary Dictionary of Natural Products (DNP v 29.2) was also used for comparison purposes only and is not publicly disseminated. FooDB was also curated but not publicly disseminated since its license proscribed sharing in Wikidata. Appendix 1 gives all necessary details regarding electronic NP resources access and characteristics.

Manual inspection of each electronic NP resource revealed that the structure, organism, and reference fields were widely variable in format and contents, thus requiring standardization to be comparable. The initial stage consisted of writing tailored scripts that are capable of harmonizing and categorizing knowledge from each source (Figure 1). This transformative process led to three categories: fields relevant to the chemical structure described, to the producing biological organism, and the reference describing the occurrence of the chemical structure in the producing biological organism. This process resulted in categorized columns for each source, providing an initial harmonized format for each table.

For all thirty-eight sources, if a single file or multiple files were accessible via a download option including FTP, data was gathered that way. For some sources, data was scraped (cf. Appendix 1). All scraping scripts can be found in the lotus-processor repository in the src/1_gathering directory (under each respective subdirectory). Data extraction scripts for the DNP are available and should allow users with a DNP license only to further exploit the data (src/1_gathering/db/dnp). The chemical structure fields, organism fields, and reference fields were manually categorized into three, two, and ten subcategories, respectively. For chemical structures, “InChI”, “SMILES”, and “chemical name” (not necessarily IUPAC). For organisms, “clean” and “dirty”, meaning lot text not referred to the canonical name was present or the organism was not described by its canonical name (e.g. “Compound isolated from the fresh leaves of Citrus spp.”). For the references, the original reference was kept in the “original” field. When the format allowed it, references were divided into: “authors”, “doi”, “external”, “isbn”, “journal”, “original”, “publishing details”, “pubmed”, “title”, “split”. The generic “external” field was used for all external cross-references to other websites or electronic NP resources (e.g. “also in knapsack”). The last subcategory, “split”, corresponds to a still non-atomic field after the removal of parts of the original reference. Other field titles are self-explanatory. The producing organism field was kept as a single field.

Data harmonization

Request a detailed protocol

To perform the harmonization of all previously gathered sources, sixteen columns were chosen as described above. Upon electronic NP resources harmonization, resulting subcategories were divided and subject to further processing. The ‘chemical structure’ fields were divided into files according to their subcategories (“InChI”, “names” and “SMILES”). A file containing all initial structures from all three subcategories was also generated. The same procedure was followed for organisms and references.

Data processing

Request a detailed protocol

To obtain an unambiguously referenced structure-organism pair for Wikidata dissemination, the initial sixteen columns were translated and processed into three fields: the reported structure, the organism canonical name, and the reference. The structure was reported as InChI, together with its SMILES and InChIKey translation. The biological organism field was reported as three minimal necessary and sufficient fields, namely its canonical name and the taxonID and taxonomic DB corresponding to the latter. The reference was reported as four minimal fields, namely reference title, DOI, PMCID, and PMID, one being sufficient. For the forthcoming translation processes, automated solutions were used when available. However, for specific cases (common or vernacular names of the biological organisms, Traditional Chinese Medicine (TCM) names, and conversion between digital reference identifiers), no solution existed, thus requiring the use of tailored dictionaries. Their construction is detailed in the Dictionaries section. The initial entries (containing one or multiple producing organisms per structure, with one or multiple accepted names per organism) were processed into 2M+ referenced structure-organism pairs.

Chemical structures

Request a detailed protocol

To retrieve as much information as possible from the original structure field(s) of each of the sources, the following procedure was followed. Allowed structural fields for the sources were divided into two types: structural (InChI, SMILES) or nominal (chemical name, not necessarily IUPAC). If multiple fields were present, structural identifiers were preferred over structure names. Among structural identifiers, when both identifiers were present, SMILES was preferred over InChI. InChI were translated to SMILES using the RDKit, 2021 implementation in Python 3.8 (src/2_curating/2_editing/structure/1_translating/inchi.py). They were first converted to ROMol objects which were then converted to SMILES. When no structural identifier was available, the nominal identifier was translated to InChI first thanks to OPSIN (Lowe et al., 2011), a fast Java-based translation open-source solution. If no translation was obtained, chemical names were then submitted to the PUG-REST, the interface for programmatic access to PubChem (Kim et al., 2018; Kim et al., 2015b). If again no translation was obtained, candidates were then submitted to the Chemical Identifier Resolver. Before the translation process, some typical chemical structure-related greek characters (such as α, ß) were replaced by their textual equivalents (alpha, beta) to obtain better results. All pre-translation steps are included in the preparing_name function and are available in src/r/preparing_name.R.

The chemical sanitization step sought to standardize the representation of chemical structures coming from different sources. It consisted of three main stages (standardizing, fragment removal, and uncharging) achieved via the MolVS package. The initial standardizer function consists of six stages (RDKit Sanitization, RDKit Hs removal, Metals Disconnection, Normalization, Acids Reionization, and Stereochemistry recalculation) detailed in the molvs documentation. In a second step, the FragmentRemover functionality was applied using a list of SMARTS to detect and remove common counterions and crystallization reagents sometimes occurring in the input DB. Finally, the Uncharger function was employed to neutralize molecules when appropriate.

Molconvert function of the MarvinSuite was used for traditional and IUPAC names translation, Marvin 20.19, ChemAxon. When stereochemistry was not fully defined, (+) and (-) symbols were removed from names. All details are available in the following script: src/2_curating/2_editing/structure/4_enriching/naming.R. Chemical classification of all resulting structures was done using classyfireR (Djoumbou Feunang et al., 2016) and NPClassifier API.

After manual evaluation, structures remaining as dimers were discarded (all structures containing a “.” in their SMILES were removed).

From the 283,267 initial InChI, 242,068 (85%) sanitized structures were obtained, of which 185,929 (77%) had complete stereochemistry defined. A total of 203,718 (72%) were uploaded to Wikidata. From the 248,185 initial SMILES, 207,658 (84%) sanitized structures were obtained, of which 98,685 (48%) had complete stereochemistry defined. 174,091 (70%) were uploaded to Wikidata. From the 49,675 initial chemical names, 27,932 (56%) sanitized structures were obtained, of which 17,460 (63%) had complete stereochemistry defined. 23,036 (46%) were uploaded to Wikidata. In total, 163,800 structures with fully defined stereochemistry were uploaded as “chemical compounds” (Q11173), and 106,669 structures without fully defined stereochemistry were uploaded as “group of stereoisomers” (Q59199015).

Biological organisms

Request a detailed protocol

The processing at the biological organism’s level had three objectives: convert the original organism string to (a) taxon name(s), atomize fields containing multiple taxon names, and deduplicate synonyms. The original organism strings were treated with Global Names Finder (GNF) and Global Names Verifier (GNV), both tools coming from the Global Names Architecture (GNA) a system of web services that helps people to register, find, index, check, and organize biological scientific names and interconnect on-line information about species. GNF allows scientific name recognition within raw text blocks and searches for found scientific names among public taxonomic DB. GNV takes names or lists of names and verifies them against various biodiversity data sources. Canonical names, their taxonID, and the taxonomic DB they were found in were retrieved. When a single entry led to multiple canonical names (accepted synonyms), all of them were kept. Because both GNF and GNV recognize scientific names and not common ones, common names were translated before a second resubmission.

Dictionaries

Request a detailed protocol

To perform the translations from common biological organism name to latin scientific name, specialized dictionaries included in DrDuke, FooDB, PhenolExplorer were aggregated together with the translation dictionary of GBIF Backbone Taxonomy. The script used for this was src/1_gathering/translation/common.R. When the canonical translation of a common name contained a specific epithet that was not initially present, the translation pair was discarded (for example, “Aloe” translated in “Aloe vera” was discarded). Common names corresponding to a generic name were also discarded (for example “Kiwi” corresponding to the synonym of an Apteryx spp. (https://www.gbif.org/species/4849989)). When multiple translations were given for a single common name, the following procedure was followed: the canonical name was split into species name, genus name, and possible subnames. For each common name, genus names and species names were counted. If both the species and genus names were consistent at more than 50%, they were considered consistent overall and, therefore, kept (for example, “Aberrant Bush Warbler” had “Horornis flavolivaceus” and “Horornis flavolivaceus intricatus” as translation; as both the generic (“Horornis”) and the specific (“flavolivaceus”) epithets were consistent at 100%, both (“Horornis flavolivaceus”) were kept). When only the generic epithet had more than 50% consistency, it was kept (for example, “Angelshark” had “Squatina australis” and “Squatina squatina” as translation, so only “Squatina” was kept). Some unspecific common names were removed (see https://doi.org/10.5281/zenodo.5801816 Rutz, 2021) and only common names with more than three characters were kept. This resulted in 181,891 translation pairs further used for the conversion from common names to scientific names. For TCM names, translation dictionaries from TCMID, TMMC, and coming from the Chinese Medicine Board of Australia were aggregated. The script used for this was src/1_gathering/translation/tcm.R. Some unspecific common names were removed (see https://doi.org/10.5281/zenodo.5801816 Rutz, 2021). Careful attention was given to the Latin genitive translations and custom dictionaries were written (see https://doi.org/10.5281/zenodo.5801816 Rutz, 2021). Organ names of the producing organism were removed to avoid wrong translation (see https://doi.org/10.5281/zenodo.5801816 Rutz, 2021). This resulted in 7,070 translation pairs. Both common and TCM translation pairs were then ordered by decreasing string length, first translating the longer names to avoid part of them being translated incorrectly.

Translation

Request a detailed protocol

To ensure compatibility between obtained taxonID with Wikidata, the taxonomic DB 3 (ITIS), 4 (NCBI), 5 (Index Fungorum), 6 (GRIN Taxonomy for Plants), 8 (The Interim Register of Marine and Nonmarine Genera), 9 (World Register of Marine Species), 11 (GBIF Backbone Taxonomy), 12 (Encyclopedia of Life), 118 (AmphibiaWeb), 128 (ARKive), 132 (ZooBank), 147 (Database of Vascular Plants of Canada (VASCAN)), 148 (Phasmida Species File), 150 (USDA NRCS PLANTS Database), 155 (FishBase), 158 (EUNIS), 163 (IUCN Red List of Threatened Species), 164 (BioLib.cz), 165 (Tropicos - Missouri Botanical Garden), 167 (The International Plant Names Index), 169 (uBio NameBank), 174 (The Mammal Species of The World), 175 (BirdLife International), 179 (Open Tree of Life), 180 (iNaturalist), and 187 (The eBird/Clements Checklist of Birds of the World) were chosen. All other available taxonomic DB are listed at http://index.globalnames.org/datasource. To retrieve as much information as possible from the original organism field of each of the sources, the following procedure was followed: First, a scientific name recognition step, allowing us to retrieve canonical names was carried (src/2_curating/2_editing/organisms/subscripts/1_processingOriginal.R). Then, a subtraction step of the obtained canonical names from the original field was applied, to avoid unwanted translation of parts of canonical names. For example, Bromus mango contains “mango” as a specific epithet, which is also the common name for Mangifera indica. After this subtraction step, the remaining names were translated from vernacular (common) and TCM names to scientific names, with help of the dictionaries. For performance reasons, this processing step was written in Kotlin and used coroutines to allow efficient parallelization of that process (src/2_curating/2_editing/organisms/2_translating_organism_kotlin/). They were subsequently submitted again to scientific name recognition (src/2_curating/2_editing/organisms/3_processingTranslated.R).

After full resolution of canonical names, all obtained names were submitted to rotl (Michonneau et al., 2016) to obtain a unified taxonomy. From the 88,395 initial “clean” organism fields, 43,936 (50%) canonical names were obtained, of which 32,285 (37%) were uploaded to Wikidata. From the 300 initial “dirty” organism fields, 250 (83%) canonical names were obtained, of which 208 (69%) were uploaded to Wikidata.

ReferenceThe Rcrossref package (Chamberlain et al., 2020) interfacing with the Crossref API was used to translate references from their original subcategory (“original”, “publishingDetails”, “split”, “title”) to a DOI, the title of its corresponding article, the journal it was published in, its date of publication and the name of the first author. The first twenty candidates were kept and ranked according to the score returned by Crossref, which is a tf-idf score. For DOI and PMID, only a single candidate was kept. All DOIs were also translated with this method, to eventually discard any DOI not leading to an object. PMIDs were translated, thanks to the entrez_summary function of the rentrez package (Winter, 2017). Scripts used for all subcategories of references are available in the directory src/2_curating/2_editing/reference/1_translating/. Once all translations were made, results coming from each subcategory were integrated, (src/2_curating/2_editing/reference/2_integrating.R) and the producing organism related to the reference was added for further treatment. Because the crossref score was not informative enough, at least one other metric was chosen to complement it. The first metric was related to the presence of the producing organism’s generic name in the title of the returned article. If the title contained the generic name of the organism, a score of 1 was given, else 0. Regarding the subcategories “doi”, “pubmed” and “title”, for which the same subcategory was retrieved via crossref or rentrez, distances between the input’s string and the candidates’ one were calculated. Optimal string alignment (restricted Damerau-Levenshtein distance) was used as a method. Among “publishing details”, “original” and “split” categories, three additional metrics were used: If the journal name was present in the original field, a score of 1 was given, else 0. If the name of the first author was present in the original field, a score of 1 was given, else 0. Those three scores were then summed together. All candidates were first ordered according to their crossref score, then by the complement score for related subcategories, then again according to their title-producing organism score, and finally according to their translation distance score. After this re-ranking step, only the first candidate was kept. Finally, the Pubmed PMCID dictionary (PMC-ids.csv.gz) was used to perform the translations between DOI, PMID, and PMCID (src/2_curating/2_editing/reference/3_processing.R).

From the 36,710 initial “original” references, 21,970 (60%) references with sufficient quality were obtained, of which 15,588 (71%) had the organism name in their title. 14,710 (40%) were uploaded to Wikidata. From the 21,953 initial “pubmed” references, 9452 (43%) references with sufficient quality were obtained, of which 6,098 (65%) had the organism name in their title. 5553 (25%) were uploaded to Wikidata. From the 37,371 initial “doi” references, 20,139 (54%) references with sufficient quality were obtained, of which 15,727 (78%) had the organism name in their title. 15,351 (41%) were uploaded to Wikidata. From the 29,600 initial “title” references, 17,417 (59%) references with sufficient quality were obtained, of which 12,675 (73%) had the organism name in their title. 10,725 (36%) were uploaded to Wikidata. From the 11,325 initial “split” references, 5856 (52%) references with sufficient quality were obtained, of which 3,206 (55%) had the organism name in their title. 2,854 (25%) were uploaded to Wikidata. From the 3,314 initial “publishingDetails” references, 119 (4%) references with sufficient quality were obtained, of which 59 (50%) had the organism name in their title. 58 (2%) were uploaded to Wikidata.

Data realignment

Request a detailed protocol

In order to fetch back the referenced structure-organism pairs links in the original data, the processed structures, processed organisms, and processed references were re-aligned with the initial entries. This resulted in 6.2M+ referenced structure-organism pairs. Those pairs were not unique, with redundancies among electronic NP resources and different original categories leading to the same final pair (for example, entry reporting InChI=1/C21H20O12/c22-6-13-15(27)17(29)18(30)21(32-13)33-20-16(28)14-11(26)4-8(23)5-12(14)31-19(20)7-1-2-9(24)10(25)3-7/h1-5,13,15,17–18,21-27,29–30H,6H2/t13-,15+,17+,18-,21+/m1/s1 in Crataegus oxyacantha or InChI=1S/C21H20O12/c22-6-13-15(27)17(29)18(30)21(32-13)33-20-16(28)14-11(26)4-8(23)5-12(14)31-19(20)7-1-2-9(24)10(25)3-7/h1-5,13,15,17–18,21-27,29–30 H,6H2/t13-,15+,17+,18-,21+/m1/s1 in Crataegus stevenii both led to OVSQVDMCBVZWGM-DTGCRPNFSA-N in Crataegus monogyna). After deduplication, 2M+ unique structure-organism pairs were obtained.

After the curation of all three objects, all of them were put together again. Therefore, the original aligned table containing the original pairs was joined with each curation result. Only entries containing a structure, an organism, and a reference after curation were kept. Each curated object was divided into minimal data (for Wikidata upload) and metadata. A dictionary containing original and curated object translations was written for each object to avoid those translations being made again during the next curation step (src/2_curating/3_integrating.R).

Data validation

Request a detailed protocol

The pairs obtained after curation were of different quality. Globally, structure and organism translation was satisfactory whereas reference translation was not. Therefore, to assess the validity of the obtained results, a randomized set of 420 referenced structure-organism pairs was sampled in each reference subcategory and validated or rejected manually. Entries were sampled with at least 55 of each reference subcategory present (to get a representative idea of each subcategory) (src/3_analyzing/1_sampling.R). An entry was only validated if: (i) the structure (as any structural descriptor that could be linked to the final sanitized InChIKey) was described in the reference (ii) the producing organism (as any organism descriptor that could be linked to the accepted canonical name) was described in the reference and (iii) the reference was describing the occurrence of the chemical structure in the biological organism. Results obtained on the manually analyzed set were categorized according to the initial reference subcategory and are detailed in Appendix 2. To improve these results, further processing of the references was needed. This was done by accepting entries whose reference was coming from a DOI, a PMID, or from a title which restricted Damerau-Levenshtein distance between original and translated was lower than ten or if it was coming from one of the three main journals where NP occurrences are commonly expected to be published (i.e. Journal of Natural Products, Phytochemistry, or Journal of Agricultural and Food Chemistry). For “split”, “publishingDetails” and “original” subcategories, the year of publication of the obtained reference, its journal, and the name of the first author were searched in the original entry and if at least two of them were present, the entry was kept. Entries were then further filtered to keep the ones where the reference title contained the first element of the detected canonical name. Except for COCONUT, exceptions to this filter were made for all DOI-based references. The function resulting from those rules is (filter_dirty.R). To validate those filtering criteria, an additional set of 100 structure-organism pairs were manually analyzed. F0.5 score was used as a metric. F0.5 score is a modified F1 score where precision has twice more weight than recall. The F-score was calculated with ß = 0.5, as in Equation 1:

(1) Fβ=(1+β2)precisionrecall(β2precision)+recall

Based on this first manually validated dataset, filtering criteria (src/r/filter_dirty.R) were established to maximize precision and recall. Another 100 entries were sampled, this time respecting the whole set ratios. After manual validation, 97% of true positives were reached on the second set. A summary of the validation results is given in Appendix 2. Once validated, the filtering criteria were established to the whole curated set to filter entries chosen for dissemination (src/3_analyzing/2_validating.R).

Unit testing

Request a detailed protocol

To provide robustness of the whole process and code, unit tests and partial data full-tests were written. They can run on the developer machine but also on the CI/CD system (GitHub) upon each commit to the codebase.

Those tests assess that the functions are providing results coherent with what is expected especially for edge cases detected during the development. The Kotlin code has tests based on JUnit and code quality control checks based on Ktlint, Detekt and Ben Mane’s version plugin.

Data dissemination

Wikidata

Request a detailed protocol

All the data produced for this work has been made available on Wikidata under a Creative Commons 0 license according to Wikidata:Licensing. This is a “No-rights-reserved” license that places no restrictions on reuse.

Lotus.NaturalProducts.Net (LNPN)

Request a detailed protocol

The web interface is implemented following the same protocol as described in the COCONUT publication (Sorokina et al., 2021a) that is the data are stored in a MongoDB repository, the backend runs with Kotlin and Java, using the Spring framework, and the frontend is written in React.js, and completely Dockerized. In addition to the diverse search functions available through this web interface, an API is also implemented, allowing programmatic LNPN querying. The complete API usage is described on the “Documentation” page of the website. LNPN is part of the NaturalProducts.net portal, an initiative aimed at gathering diverse open NP resources in one place.

Data Interaction

Data retrieval

Request a detailed protocol

Bulk retrieval of a frozen (2021-12-20) version of LOTUS data is also available at https://doi.org/10.5281/zenodo.5794106 (Rutz et al., 2021a).

The download lotus part of lotus-wikidata-interact allows the download of all chemical compounds with a “found in taxon” property. That way, it does not only get the data produced by this work, but any that would have existed beforehand or that would have been added directly on Wikidata by our users. It makes a copy of all the entities (compounds, taxa, references) into a local triplestore that can be queried with SPARQL as is or converted to a TSV file for inclusion in other projects. It is currently adapted to export directly into the SSOT thus allowing direct reuse by the processing/curation pipeline.

Data addition

Wikidata

Request a detailed protocol

Data is loaded by the Kotlin importer available in the upload lotus part of lotus-wikidata-interact repository under a GPL V3 license and imported into Wikidata. The importer processes the curated outputs grouping references, organisms, and compounds together. It then checks if they already exist in Wikidata (using SPARQL or a direct connection to Wikidata depending on the kind of data). It then uses update or insert, also called upsert, the entities as needed. The script currently takes the tabular file of the referenced structure-organism pairs resulting from the LOTUS curation process as input. Before upload, a filtering step is performed in order to avoid re-uploading entries we already uploaded. This way, if modifications occur in Wikidata, it will not be erased by the next iteration of the importer. The importer is currently being adapted to use directly the SSOT and avoid an unnecessary conversion step. To import references, it first double checks for the presence of duplicated DOIs and utilizes the Crossref REST API to retrieve metadata associated with the DOI, the support for other citation sources such as Europe PMC is in progress. The structure-related fields are only subject to limited processing: basic formatting of the molecular formula by subscripting of the numbers. Due to limitations in Wikidata, the molecule names are dropped if they are longer than 250 characters and likewise the InChI strings cannot be stored if they are longer than 1500 characters.

Uploaded taxonomical DB identifiers are currently restricted to ITIS, GBIF, NCBI Taxon, Index Fungorum, IRMNG, WORMS, VASCAN, and iNaturalist, and newly OTL. The taxa levels are currently limited to family, subfamily, tribe, subtribe, genus, species, variety. The importer checks for the existence of each item based on their InChIKey and upserts the compound with the found in taxon statement and the associated organisms and references.

LNPN

Request a detailed protocol

From the onset, LNPN has been importing data directly from the frozen tabular data of the LOTUS dataset (https://doi.org/10.5281/zenodo.5794106 Rutz et al., 2021a). In future versions, LNPN will directly feed on the SSOT.

Data edition

Request a detailed protocol

The bot framework lotus-wikidata-interact was adapted such that, in addition to batch upload capabilities, it can also edit erroneously created entries on Wikidata. As massive edits have a large potential to disrupt otherwise good data, progressive deployment of this script is used, starting by editing progressively 1, 10, then 100 entries that are manually checked. Upon validation of 100 entries, the full script is run and check its behavior checked at regular intervals. An example of a corrected entry is as follows: https://www.wikidata.org/w/index.php?title=Q105349871&type=revision&diff=1365519277&oldid=1356145998.

Curation interface

Request a detailed protocol

A web-based (Kotlin, Spring Boot for the back-end, and TypeScript with Vue for the front-end) curation interface is currently in construction. It will allow mass-editing of entries and navigate quick navigation in the SSOT for the curation of new and existing entries. This new interface is intended to become open to the public to foster the curation of entries by further means, driven by the users. In line with the overall LOTUS approach, any modification made in this curation interface will be mirrored after validation on Wikidata and LNPN.

Code availability

General repository

Request a detailed protocol

All programs written for this work can be found in the following group: https://github.com/lotusnprod.

Processing

Request a detailed protocol

The source data curation system is available at https://github.com/lotusnprod/lotus-processor (copy archived at swh:1:rev:78e6065d8eb9d0b0d11c2ea8de6ac66b445bca0e, Rutz, 2022a). This program takes the source data as input and outputs curated data, ready for dissemination. The first step involves checking if the source data has already been processed. If not, all three elements (biological organism, chemical structures, and references) are submitted to various steps of translation and curation, before validation for dissemination.

Wikidata

Request a detailed protocol

The programs to interact with Wikidata are available at https://github.com/lotusnprod/lotus-wikidata-interact, (copy archived at swh:1:rev:92d19b8995a69f5bba39f438172ba425fdcc0f28, Rutz, 2022c). On the upload side, the program takes the processed data resulting from the lotusProcessor subprocess as input and uploads it to Wikidata. It performs a SPARQL query to check which objects already exist. If needed, it creates the missing objects. It then updates the content of each object. Finally, it updates the chemical compound page with a “found in taxon” statement complemented with a “stated in” reference. A publication importer creating an article page from a DOI is also available.

On the download side, the program takes the structured data in Wikidata corresponding to chemical compounds found in taxa with a reference associated as input and exports it in both RDF and tabular formats for further use. Two subsequent options are (a) that the end-user can directly use the exported data.; or (b) that the exported data, which can be new or modified since the last iteration, is used as new source data in lotusProcessor.

LNPN

Request a detailed protocol

The LNPN website and processing system is available at https://github.com/lotusnprod/lotus-web (copy archived at swh:1:rev:278a5ab82389ebd5df720b1876a1724d15937644, Rutz, 2022b). This system takes the processed data resulting from the lotusProcessor as input and uploads it on https://lotus.naturalproducts.net. The repository is not part of the main GitHub group as it benefits from already established pipelines developed by CS and MS. The website allows searches from different points of view, complemented with taxonomies for both the chemical and biological sides. Many chemical molecular properties and molecular descriptors that otherwise are unavailable in Wikidata are also provided.

Code freezing

Request a detailed protocol

All repository hyperlinks in the manuscript point to the main branches by default. The links contain all programs and code and will eventually be updated to a publication branch using modifications resulting from the peer-reviewing process. As the code evolves, readers are invited to refer to the main branch of each repository for the most up-to-date code. A frozen version (2021-12-23) of all programs and code is also available in the LOTUS Zenodo community (5802107 (Rutz et al., 2021f), 5802113 (Bisson et al., 2021), and 5802120 Sorokina et al., 2021b).

Programs and packages

R

Request a detailed protocol

The R versions used for the project were 4.0.2 up to 4.1.2, and R-packages used were, in alphabetical order: ChemmineR (3.42.1) (Cao et al., 2008), chorddiag (0.1.3) (Flor, 2020), ClassyfireR (0.3.6) (Djoumbou Feunang et al., 2016), data.table (1.14.2) (Dowle and Srinivasan, 2020), DBI (1.1.1) (Wickham and Müller, 2021), gdata (2.18.0) (Warnes et al., 2017), ggalluvial (0.12.3) (Brunson, 2020), ggfittext (0.9.1) (Wilkins, 2020), ggnewscale (0.4.5) (Campitelli, 2021), ggraph (2.0.5) (Pedersen, 2020), ggstar (1.0.2) (Xu, 2021), ggtree (3.2.0) (Yu, 2017), ggtreeExtra (1.4.0) (Xu et al., 2021), jsonlite (1.7.2) (Ooms, 2014), pbmcapply (1.5.0) (Kuang et al., 2019), plotly (4.10.0) (Sievert, 2020), rcrossref(1.1.0.99) (Chamberlain et al., 2020), readxl (1.3.1) (Wickham, 2018), rentrez (1.2.3) (Winter, 2017), rotl (3.0.11) (Michonneau et al., 2016), rvest (1.0.2) (Wickham, 2020), splitstackshape (1.4.8) (Mahto, 2019), RSQLite (2.2.8) (Müller et al., 2021), stringdist (0.9.8) (Loo, 2014), stringi (1.7.6) (Gagolewski, 2020), tidyverse (1.3.1) (Wickham et al., 2019), treeio (1.18.0) (Wang et al., 2020), UpSetR (1.4.0) (Gehlenborg, 2019), webchem (1.1.1) (Szöcs et al., 2020), XML (3.99–0.8) (Lang, 2020), xml2 (1.3.3) (Wickham and Hester, 2020).

Python

Request a detailed protocol

The Python version used was 3.7.12 up to 3.9.7, and the Python packages utilized were, in alphabetical order: cmcrameri (1.4) (Crameri, 2021; Crameri et al., 2020), faerun (0.3.20) (Probst and Reymond, 2018a), map4 (1.0) (Capecchi et al., 2020), matplotlib (3.5.0) (Hunter, 2007), Molvs (0.1.1), pandas (1.3.4) (Reback et al., 2020), rdkit (2021.09.2) (RDKit, 2021), scipy (1.7.3) (Virtanen et al., 2020), tmap (1.0.4) (Probst and Reymond, 2020).

Kotlin

Request a detailed protocol

Kotlin packages used were as follows: Common: Kotlin 1.4.21 up to 1.6.0, Univocity 2.9.1, OpenJDK 15, Kotlin serialization 1.3.1, konnector 0.1.34, Log4J 2.14.1 Wikidata Importer Bot:, wdkt 0.12.1, CDK 2.5 (Willighagen et al., 2017), RDF4J 3.7.4, Ktor 1.6.5, KotlinXCli 0.3.3, Wikidata data processing: Shadow 5.0.0 Quality control and testing: Ktlint 10.2.0, Kotlinter 3.3.0, Detekt 1.15.0, Ben Mane’s version plugin 0.36.0, Junit 5.8.1.

Additional executable files

Request a detailed protocol

GNFinder v.0.16.3, GNVerifier v.0.6.1, OPSIN v.2.5.0 (Lowe et al., 2011).

Data availability

Request a detailed protocol

A snapshot of the obtained data at the time of re-submission (2021-12-20) is available at the following Zenodo community: https://zenodo.org/communities/the-lotus-initiative and related record: 5793224 (Rutz et al., 2021e), 5794107 (Rutz et al., 2021c), 5794597 (Rutz et al., 2021b), 5801816 (Rutz, 2021). The https://lotus.nprod.net website is intended to gather news and features related to the LOTUS initiative in the future.

Appendix 1

Data sources list

Appendix 1—table 1
Data sources list.
DatabaseTypeInitial retrieved unique entriesCleaned referenced structure-organism pairsPairs validated for Wikidata exportActual validated pairs on WikidataWebsiteArticleRetrievalLicense statusContactDumpStatus
afrotrypopen313935554-article (Ibezim et al., 2017)downloadlicense_copyrightFidele Ntie-Kang or Ngozi Justina NwodoYESunmaintained
alkamidopen4,4162,6392,3092,160websitearticle (Boonen et al., 2012)scriptlicense_copyrightBart De SpiegeleerNOmaintained
antibasecommercial46,95645,221-------NOunmaintained
antimarincommercial73,01767,559-------NOunmaintained
biofacquimopen531519519511website (old version)article_old; article_new (Pilón-Jiménez et al., 2019)downloadlicense_CCBY_4.0José Medina-FrancoYESmaintained
biophytmolopen543558322308websitearticle (Sharma et al., 2014)scriptlicense_CCBYAnshu BhardwajNOunmaintained
carotenoiddbopen2,922639530485websitearticle (Yabuzaki, 2017)scriptlicense_copyrightyzjunko@gmail.comNOmaintained
coconutopen5,757,8725,723,691153,981140,877websitearticle (Sorokina and Steinbeck, 2020b)downloadlicense_CCBY_4.0Maria SorokinaYESmaintained
cyanometdbopen1,9051,6311,6211,605-article (Jones et al., 2021)downloadlicense_CCBY_4.0elisabeth.janssen@eawag.chYESmaintained
datawarrioropen5895417160websitearticle (Sander et al., 2015)downloadno_licensethomas.sander@idorsia.comYESretired
dianatdbopen290323115111websitearticle (Madariaga-Mazón et al., 2021)downloadlicense_CCBY_NCamadariaga@iquimica.unam.mx or kmtzm@unam.mxYESmaintained
dnpcommercial205,072254,573--website-script-support@taylorfrancis.comNOmaintained
drdukeopen90,6759,6606,1845,222website-downloadlicense_CC0agref@usda.govYESmaintained
foodbrestricted81,94139,662--website-downloadlicense_CCBY_NCjreid3@ualberta.ca (Jennifer)YESunmaintained
inflamnatopen665632306268-article (Zhang et al., 2019)downloadlicense_copyrightxiaoweilie@ynu.edu.cnYESunmaintained
knapsackopen132,127139,33659,94555,186websitearticle (Shinbo et al., 2006)scriptlicense_copyrightskanaya@gtc.naist.jpNOmaintained
metabolightsopen38,20837,7046,2415,687websitearticle (Haug et al., 2020)downloadlicense_copyright-YESmaintained
mibigopen1,3101,139638535websitearticle (Kautsar et al., 2020)downloadlicense_CCBY_4.0Tilmann Weber orMarnix MedemaYESunmaintained
mitishambaopen1,071534294291websitearticle (Derese et al., 2019)scriptlicense_copyright-NOdefunct
nanpdbopen5,7526,3835,9375,283websitearticle (Ntie-Kang et al., 2017)scriptlicense_copyrightntiekfidele@gmail.com stefan.guenther@pharmazie.uni-freiburg.deNOmaintained
napralertcommercial681,401392,498294,818270,743websitearticle (Graham and Farnsworth, 2010)-license_copyrightnapralert@uic.eduNOdefunct
npassopen290,53530,18525,42923,612websitearticle (Zeng et al., 2018)downloadlicense_CCBY_NCphacyz@nus.edu.sg jiangyy@sz.tsinghua.edu.cn iaochen@163.comYESunmaintained
npatlasopen32,53934,72634,54833,087websitearticle (van Santen et al., 2019; )downloadlicense_CCBY_4.0rliningt@sfu.caYESmaintained
npcareopen7,7635,8783,7903,538websitearticle (Choi et al., 2017)downloadlicense_CCBY_4.0choihwanho@gmail.comYESunmaintained
npediaopen82992828websitearticle (Tomiki et al., 2006)scriptno_licensehisyo@riken.jp npd@riken.jpNOdefunct
nubbeopen2,1892,3402,3402,119websitearticle (Pilon et al., 2017)-license_copyrightVanderlan S. BolzaniNOmaintained
pamdbopen3,0462,8202424websitearticle (Huang et al., 2018)downloadlicense_CCBY_NCawilks@rx.umaryland.edu aoglesby@rx.umaryland.edu mkane@rx.umaryland.eduYESunmaintained
phenolexploreropen8,0778,7007,1235,721websitearticle (Rothwell et al., 2013)downloadlicense_copyrightscalberta@iarc.frYESunmaintained
phytohubopen2,3491,14513294websitearticle (Giacomoni et al., 2017)scriptno_licenseclaudine.manach@inra.frYESunmaintained
procardbopen6,5566,2786055websitearticle (Nupur et al., 2016)scriptlicense_CCBY_4.0Anil Kumar PinnakaAshwani KumarNOunmaintained
respectopen2,7591,064634547websitearticle (Sawada et al., 2012)downloadlicense_CCBY_NC_2.1_Japanksaito@psc.riken.jpYESunmaintained
sancdbopen860925747732websitearticle (Hatherley et al., 2015)scriptlicense_CCBY_4.0Özlem Tastan BishopNOunmaintained
streptomedbopen71,63833,21720,71518,395websitearticle (Klementz et al., 2016)downloadlicense_copyrightstefan.guenther@pharmazie.uni-freiburg.deYESmaintained
swmdopen1,0751,7511,5971,479websitearticle (Davis and Vasanthi, 2011)scriptlicense_CCBY_4.0Dicky.John@gmail.comNOunmaintained
tmdbopen2,1165332624websitearticle (Yue et al., 2014)scriptlicense_copyrightXiao-Chun WanGuan-Hu BaoNOunmaintained
tmmcopen15,0337,8335,8264,015websitearticle (Kim et al., 2015a)downloadlicense_copyrightJeong-Ju LeeYESunmaintained
tpptopen27,18223,872684641websitearticle (Günthardt et al., 2018)downloadlicense_copyrightthomas.bucheli@agroscope.admin.chYESunmaintained
unpdopen331,242304,683211,158197,710websitearticle (Gu et al., 2013)-license_CCBY_4.0lirongc@pku.edu.cn xiaojxu@pku.edu.cnNOdefunct
wakankensakuopen367224208202website-script--NOdefunct
Wikidataopen951,268960,611959,747919,752website-downloadlicense_CC0-YESmaintained

Appendix 2

Summary of the validation statistics

Appendix 2—table 1
Summary of the Validation Statistics.
First validation dataset (n = 420)Second validation dataset (n = 100)
Reference TypeTrue positiveFalse positiveFalse negativeTrue negativeRatioPrecisionRecallF0.5 scoreTrue positiveFalse negative
Original8067110.310.930.920.92381
Pubmed371560.300.970.880.9251
DOI1156060.190.951.000.97431
Title3820160.120.951.000.9770
Split8015270.081.000.350.5240
Publishing details101320.011.000.500.6700
Total2791528981.00---973
Corrected total-----0.960.890.91--

Data availability

A snapshot of the obtained data at the time of re-submission (2021-12-20) is available at the following Zenodo community: https://zenodo.org/communities/the-lotus-initiative and related records: https://zenodo.org/record/5793224, https://zenodo.org/record/5794107, https://zenodo.org/record/5794597 and https://zenodo.org/record/5801816. The https://lotus.nprod.net website is intended to gather news and features related to the LOTUS initiative in the future.

References

  1. Book
    1. Blomqvist E
    2. Hose K
    3. Paulheim H
    4. Ławrynowicz A
    5. Ciravegna F
    6. Hartig O
    (2017)
    The Semantic Web: ESWC 2017 Satellite Events
    Cham: Springer.
  2. Website
    1. GBIF
    (2020) GBIF
    Accessed December 9, 2021.
    1. Lee CJ
    2. Sugimoto CR
    3. Zhang G
    4. Cronin B
    (2013) Bias in peer review
    Journal of the American Society for Information Science and Technology 64:2–17.
    https://doi.org/10.1002/asi.22784
  3. Software
    1. RDKit
    (2021) RDKit: Open-source cheminformatics
    GitHub/SourceForge.

Decision letter

  1. David A Donoso
    Reviewing Editor; Escuela Politécnica Nacional, Ecuador
  2. Anna Akhmanova
    Senior Editor; Utrecht University, Netherlands
  3. Charles Tapley Hoyt
    Reviewer

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "The LOTUS Initiative for Open Natural Products Research: Knowledge Management through Wikidata" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Anna Akhmanova as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Charles Tapley Hoyt (Reviewer #1).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. Please notice you count with detailed comments from three reviewers. While all of them see potential in your ms, we encourage you to provide a meaningful revision and detailed point-by-point response to criticism.

Essential revisions:

1) Reviewers would like to see the authors clarify the scope of their work and address to what extent this database will provide the community with a comprehensive database of structure organisms pairs, that is actually useable to answer research questions. This could be achieved by adding concrete examples of how these data could be used to help readers/reviewers understand the scope.

2) Reviewers think the manuscript provides missing or incomplete documentation on the LOTUS processes and software, especially in terms of reproducibility of results. Providing open access to source code and data is an important first step, and additional (challenging!) steps are needed to help others independently build, and use, the provided software/data.

3) The authors have yet to provide compelling evidence of how their continuously managed (and updated) resources can be reliably cited in scholarly literature and incorporated in scientific workflows. In order to study and reference Wikidata, a versioned copy needs to be provided: Wikidata is updated constantly, and these constant stream of changes make it hard for others to verify results extracted from some older version of the Wikidata corpus unless a versioned copy is provided.

Reviewer #1 (Recommendations for the authors):

“A third fundamental element of a structure-organism pair is a reference to the experimental evidence that establishes the linkages between a chemical structure and a biologicl organism and a future-oriented electronic NP resource should contain only fully-referenced structure-organism pairs.”

Typo in "biologicl"

“Currently, no open, cross-kingdom, comprehensive, computer-interpretable electronic NP resource links NP and their producing organisms, along with referral to the underlying experimental work”.

missing "that"

"KNApSAck currently contains 50,000+ structures and 100,000+ structure-organism pairs. However, the organism field is not standardized and access to the data is not straightforward".

This is the first opportunity in the manuscript to describe in more detail the perils of previous databases, especially the mess that is KNApSAck, which you have no choice but to work on because of its ubiquity.

“NAPRALERT is not an open platform, employing an access model that provides only limited free searches of the dataset”.

There's an awful lot of praise for this database given that it is directly antithetical to the manuscript. Please provide further commentary contrasting the work in NAPRALERT (particularly, about its shortcomings as a closed resource) to LOTUS.

“FAIR and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) principles”.

If you really want to go down the buzzword bingo, I'd suggest making a table or going in depth into each point. These acronyms are, in my opinion, effectively meaningless from a technical perspective, so it falls on the authors of the paper who want to use them (as you are no doubt pressured to do in the modern publishing landscape) to define them and qualify their relevance to your more practical goals.

“…any researcher to contribute, edit and reuse the data with a clear and open CC0 license (Creative Commons 0).”

This is a really interesting point considering you have taken information from several unlicensed databases and several that have more permissive licenses. How do you justify this?

“The SSOT approach consists of a PostgreSQL DB that structures links and data schemes such that every data element has a single place”.

Why is Wikidata not the single source of truth? This means that the curation and generation of this dataset can never really be decentralized- it will always have to be maintained by someone who is the maintainer of the PostgreSQL database. What are the pros/cons for this?

“The LOTUS processing pipeline is tailored to efficiently include and diffuse novel or curated data directly from new sources or at the Wikidata level.”

This sentence is confusing

“All stages of the workflow are described on the git sites of the LOTUS initiative at https://gitlab.com/lotus7 and https://github.com/mSorok/LOTUSweb

Later this became a problem for code review. Why are the code all in different places? There are "organizations" both on GitLab and GitHub to keep related code together. Further, putting code in a personal user's namespace makes it difficult to for potential community involvement, especially if the user becomes inactive. Also a matter of opinion, but most science is being done on GitHub. Using GitLab will likely limit the number of people who will interact with the repository. Highly suggested to move it to GitHub.

“All necessary scripts for data gathering and harmonization can be found in the lotus-processor repository in the src/1_gathering directory."

The reference to "SI 1 Data Sources List" does not include the actual license information in the table, but rather links to READMEs (which may not stay stable for the life of the manuscript). This should explicitly state which license each database has. Further, it seems a bit disingenuous since some of these links point to README pages that state there is no license given, such as for Biofacquim (and several others)

https://gitlab.com/lotus7/lotus-processor/-/blob/main/docs/licenses/biofacquim.md

How can you justify taking this content and redistributing it under a more permissive license?

Further, this table should contain versioning information for each database and a flag as to whether it is still being maintained, whether data can be accessed as a dump, and if there is a dump, if it is structured. Right now, the retrieval column is not sufficient for describing how the data is actually procured.

Can you cross-reference these databases to Wikidata pages? FAIRSharing pages? I'm sure following publication in the near- or mid-term, more of these databases will go down permanently, so there should be as much information about what they were available with this publication.

Small suggestion: right align all numbers in tables.

“This process allowed us to establish rules for automatic filtering and validation of the entries. The filtering was then applied to all entries.”

Please explain this process, in detail (and also link to the exact code that does it as you have done in other sections)

“Table 1: Example of a referenced structure-organism pair before and after curation”.

Would like some discussion of the importance of anatomical region in addition to organism. Obviously, there is importance in the production of the NP in a given region. This is highly granular and makes the problem more difficult, but even if you don't tackle it, it has to be mentioned.

Further, why is the organism not standardized to a database identifier? Why are the prefixes to go along with these ID spaces not mentioned (InChI-key for structure, DOI for reference, and what for organism?) Maybe one of the confounders is that there are many taxonomical databases for organisms with varying coverage – this should be addressed. On second pass through the review, I noticed this was described a bit more in the methods section, so this should be linked. Further, it's actually not clear even by the end of the methods section what identifiers are used in the end. Are Wikidata Q identifiers the single unifying identifier in the end?

“Figure 2: Alluvial plot of the data transformation flow within LOTUS during the automated curation and validation processes.”

Looks like more things are rejected than kept. What about having a human curator in the loop? Or should poorly curated stuff be thrown way forever? There was a small mention about building a human curation system, but I don't think this had the same focus.

“The figure highlights, for example, the essential contribution of the DOI category of references contained in NAPRALERT towards the current set of validated references in LOTUS”.

Is this to say LOTUS is constructed with the closed data in NAPRALERT? Or the small portion that was made open was sufficient?

“…any researcher has a means of providing new data for LOTUS, keeping in mind the inevitable delay between data addition and subsequent inclusion into LOTUS.”

How likely do you think users will be to contribute their data to LOTUS, when there are many competing interests, like publishing their own database on their own website, which will support their own publication (and then likely go into obscurity)?

Ultimately, this means there should be a long-term commitment from the LOTUS initiative to continue to find new databases as they're published, to do outreach to authors (if possible/necessary), and to incorporate the new databases into the LOTUS pipeline as a third party. This should be described in detail – which organizations will do this/how will they fund it?

“For this, biological organisms were grouped into four artificial taxonomic levels (plants, fungi, animals and bacteria).”

How? Be more specific about methodology including code/dependencies

“Figure 6: TMAP visualizations of the chemical diversity present in LOTUS. Each dot”.

This color scheme is hard to interpret. Please consider using figure 6 in https://www.nature.com/articles/s41467-020-19160-7 to help pick a better color scheme.

“The biological specificity score at a given taxonomic level for a given chemical class is calculated as the number of structure-organism pairs within the taxon where the chemical class occurs the most, divided by the total number of pairs.”

Have you considered the Kullback-Leibler divergence (https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) which is often used in text mining to compare the likelihood of a query in a given subset of a full corpus? I think this would also be appropriate for this setting.

“This specificity score was calculated as in Equation 2:”

This is a very similar look to the classic Jaccard Index (https://en.wikipedia.org/wiki/Jaccard_index; which is skewed by large differences in the size of the two sets) and the overlap coefficient (https://en.wikipedia.org/wiki/Overlap_coefficient). I think the multiplication of the sizes of the sets will have the same issue as Jaccard index. What is the justification for using this non-standard formulation? Please comment if the conclusions change from using the overlap coefficient.

Since this is supporting Figure 7, a new way to explore the chemical and biological diversity, and you previously wrote the huge skew in organisms and taxa over which information is available, it would be important to comment on the possible bias in this figure. Were you trying to make the point that the NPClassifier system helps reduce this bias? If so, it was unclear.

“As LOTUS follows the guidelines of FAIRness and TRUSTworthiness, all researchers across disciplines can benefit from this opportunity, whether the interest is in ecology and evolution, chemical ecology, drug discovery, biosynthesis pathway elucidation, chemotaxonomy, or other research fields that connect with NP.”

This was emphatically not qualified within the manuscript. The expansions for FAIR and TRUST were given, but with no explanation for what they are. Given that these are buzzwords that have very little meaning, it would either be necessary to re-motivate their importance in the introduction when you mention them, then refer to them throughout, or to skip them. That being said, I see the need for authors to use buzzwords like this to interest a large audience (who themselves do not necessarily understand the purpose of the concepts).

“This project paves the way for the establishment of an open and expandable electronic NP resource.”

By the end of the conclusion, there was no discussion of some of the drawbacks and roadblocks that might get in the way of the success of your initiative (or broadly how you would define success). I mentioned a few times already in this review places where there were meaningful discussions that were overlooked. This would be a good place to reinvestigate those.

Methods

I'm a huge opponent to the methods section coming after the results. There were a few places where the cross-linking worked well, but when reading the manuscript from top to bottom, I had many questions that made it difficult to proceed. The alternative might not be so good either, so I understand that this likely can't be addressed.

“Before their inclusion, the overall quality of the source was manually assessed to estimate, both, the quality of referenced structure-organism pairs and the lack of ambiguities in the links between data and references.”

How was quality assessed? Do you have a flowchart describing the process?

“Traditional Chinese Medicine (TCM) names, and conversion between digital reference identifiers, no solution existed, thus requiring the use of tailored dictionaries.”

How were these made? Where are the artifacts? Is it the same as the artifacts linked in the later section? Please cross link to them if so.

“All pre-translation steps are included in the preparing_name function and are available in src/r/preparing_name.R.”

For this whole section: why were no NER methods applied?

“When stereochemistry was not fully defined, (+) and (-) symbols were removed from names.”

Given the domain, this seems like a huge loss of information. Can you reflect on what the potential negative impacts of this will be? This also comes back to my other suggestion to better motivate the downstream work that uses databases like LOTUS – how could it impact one of those specific examples?

“After full resolution of canonical names, all obtained names were submitted to rotl (Michonneau et al., 2016)”.

missing reference.

Overall

Thanks for writing such a nice manuscript. I very much enjoyed reading it.

In direct conflict with all of the other things that could improve this manuscript, it's also quite long already. Because it reads easily, this wasn't too much of a problem, but use your best judgement if/when making updates.

Reviewer #2 (Recommendations for the authors):

Overall well written with few errors.

page 14:

"Targeted queries allowing to interrogate LOTUS data from the perspective of one of the three objects forming the referenced structure-organism pairs can be also built."

Should be "can also be built"

Some of the figures after page 68 appear to be corrupted.

Reviewer #3 (Recommendations for the authors):

Thank you for submitting your manuscript. Your approach to re-using an existing platform (e.g., Wikidata) in combination with carefully designed integration and harmonization workflows has great potential to help make better use of existing openly available datasets.

However, I was unable to complete my review due to my inability to reproduce (even partially) the described LOTUS workflows. I did, however, appreciate the attempt that the authors made to work towards meeting the eLife data access policies:

from https://submit.elifesciences.org/html/eLife_author_instructions.html#policies accessed on 5 Aug 2021 :

"[…] Availability of Data, Software, and Research Materials

Data, methods used in the analysis, and materials used to conduct the research must be clearly and precisely documented, and be maximally available to any researcher for purposes of reproducing the results or replicating the procedure.

Regardless of whether authors use original data or are reusing data available from public repositories, they must provide program code, scripts for statistical packages, and other documentation sufficient to allow an informed researcher to precisely reproduce all published results. […]"

I've included my review notes below for your consideration and I am looking forward to your comments.

re: "LOTUS employs a Single Source of Truth (SSOT, Single_source_of_truth) to ensure data reliability and continuous availability of the latest curated version of LOTUS data in both Wikidata and LNPN (Figure 1, stage 4). The SSOT approach consists of a PostgreSQL DB that structures links and data schemes such that every data element has a single place."

Single source of truth is advocated by authors, but not explicitly referenced in statements upserted into Wikidata. How can you trace the origin of LOTUS Wikidata statements to a specific version of the SSOT LOTUS postgres db?

re: Figure 1: Blueprint of the LOTUS initiative.

Data cleaning is a subjective statement: what may be "clean" data to some, may be considered incomplete or incorrectly transformed by others. Suggest to use less subjective statement like "process", "filter", "transform", or "translate".

re: "The contacts of the electronic NP resources not explicitly licensed as open were individually reached for permission to access and reuse data. A detailed list of data sources and related information is available as SI-1."

The provided dataset references provide no mechanism for verifying that the referenced data was in fact the data used in the LOTUS initiative. A method is provided for versioning the various data products across different stages (e.g., external, interim, processed validation), but README was insufficient to access these data (for example see below).

re: "All necessary scripts for data gathering and harmonization can be found in the lotus-processor repository in the src/1_gathering directory. All subsequent and future iterations that include additional data sources, either updated information from the same data sources or new data, will involve a comparison of the new with previously gathered data at the SSOT level to ensure that the data is only curated once."

re: "These tests allow a continuous revalidation of any change made to the code, ensuring that corrected errors will not reappear."

How do you continuously validate the code? Periodically, or only when changes occur? In my experience, claiming continuous validation and testing is supported by active continuous integration / automated testing loops. However, I was unable to find links to continuous testing logs on platforms like travis-ci.org, github actions, or similar.

re: "LOTUS uses Wikidata as a repository for referenced structure-organism pairs, as this allows documented research data to be integrated with a large, pre-existing and extensible body of chemical and biological knowledge. The dynamic nature of Wikidata fosters the continuous curation of deposited data through the user community. Independence from individual and institutional funding represents another major advantage of Wikidata."

How is Wikidata independent from individual and institutional funding? Assuming that it takes major funding to keep Wikidata up and running, what other funding source is used by Wikidata beyond individual donations and institutional grants/support?

re: "The openness of Wikidata also o ers unprecedented opportunities for community curation, which will support, if not guarantee, a dynamic and evolving data repository."

Can you provide example in which community curation happened? Did any non-LOTUS contributor curate data provided by LOTUS initiative?

"As the taxonomy of living organisms is a complex and constantly evolving eld, all the taxon identi ers from all accepted taxonomic DB for a given taxon name were kept. Initiatives such as the Open Tree of Life (OTL) (Rees and Cranston, 2017) will help to gradually reduce these discrepancies, the Wikidata platform should support such developments."

Discrepancies, conflicts, taxonomic revisions, and disagreements are common between taxonomies due to scientific differences, misaligned update schedules, and transcription errors. However, Wikidata does not support organized dissent, instead forcing a single (artificial) taxonomic view onto commonly used names. In my mind, this oversimplifies taxonomic realities, favors one-size-fits-none taxonomies, and leaves behind the specialized taxonomic authorities that are less tech savvy, but highly accurate in their systematics. How do you imagine dealing with conflicting, or outdated, taxonomic interpretations related to published names?

"The currently favored approach to add new data to LOTUS is to edit Wikidata entries directly. Newly edited data will then be imported into the SSOT repository. There are several ways to interact with Wikidata which depend on the technical skills of the user and the volume of data to be imported/modi ed."

How do you deal with edit conflicts? E.g., source data is updated around the same time as an expert updates the compound-organism-reference triple through Wikidata. Which edit remains?

"Even if correct at a given time point, scientic advances can invalidate or update previously uploaded data. Thus, the possibility to continuously edit the data is desirable and guarantees data quality and sustainability. Community-maintained knowledge bases such as Wikidata encourage such a process."

On using LOTUS in scientific publications, how can you retrieve the exact version of Wikidata or LOTUS resource referenced in a specific publication? In my understanding, specific Wikidata versions are not easy to access and discarded periodically. Please provide an example of how you imagine LOTUS users citing claims provided by a specific version of LOTUS SSOT to Wikidata.

"The LOTUS initiative provides a framework for rigorous review and incorporation of new records and already presents a valuable overview of the distribution of NP occurrences studied to date."

How can a non-LOTUS contributor dissent with a claim made by LOTUS, and make sure that their dissent in recorded in Wikidata or other available platforms? Some claims are expected to be disputed and resolved over time. Please provide an example of how you imagine dealing with unresolved disputes.

"Community participation is the most e cient means of achieving more comprehensive documentation of NP occurrences, and the comprehensive editing opportunities provided within LOTUS and through the associated Wikidata distribution platform open new opportunities for collaborative engagement."

Do you have any evidence to suggest that community participation is the most effective way to achieve more comprehensive documentation of NP occurrences? Please provide references or data to support this claim.

"In addition to facilitating the introduction of new data, it also provides a forum for critical review of existing data, as well as harmonization and veri cation of existing NP datasets as they come online."

Please provide example of cases in which a review led to a documented dispute that allowed two or more conflicting sources to co-exist. Show example of how to annotate disputed claims and claims with suggested corrections.

How do imagine using Wikidata to document evidence for a refuted "compound-found-in-organism claimed-by-reference" claim?

"Researchers worldwide uniformly acknowledge the limitations caused by the intrinsic unavailability of essential (raw) data (Bisson et al., 2016)."

Note that FAIR does not mandate open access of data, only suggests vague guidelines to find, access, integrate and re-use possibly resources that may or may not be access controlled.

If you'd like to emphasize open access in addition to FAIR, please separately mention Open Data / Open Access references.

re: "We believe that the LOTUS initiative has the potential to fuel a virtuous cycle of research habits and, as a result, contribute to a better understanding of Life and its chemistry."

LOTUS relies on the (hard) work of underlying data sources to capture the relationships of chemical compounds with organisms along with their citation reference. However, the data source are not credited in the Wikidata interface. How does *not* crediting your data sources contribute to a virtuous cycle of research habits?

re: Wikidata activity of NPImporterBot

Associated bot can be found at:

https://www.wikidata.org/wiki/User:NPImporterBot

On 5 Aug 2021, the page https://www.wikidata.org/wiki/Special:Contributions/NPImporterBot

most recent entries included a single edit from 4 June 2021

https://www.wikidata.org/w/index.php?title=Q107038883&oldid=1434969377

followed by many edits from 1 March 2021

e.g.,

https://www.wikidata.org/w/index.php?title=Q27139831&oldid=1373345585

with many entries added/upserted in early Dec 2020

with earliest entries from 28 Aug 2020

https://www.wikidata.org/w/index.php?title=User:NPImporterBot&oldid=1267066716

It appears that the bot has been inactive since June 2021. Is this expected? When does the bot become active?

re: "The Wikidata importer is available at https://gitlab.com/lotus7/lotus-wikidata-importer. This program takes the processed data resulting from the lotusProcessor subprocess as input and uploads it to Wikidata. It performs a SPARQL query to check which objects already exist. If needed, it creates the missing objects. It then updates the content of each object. Finally, it updates the chemical compound page with a "found in taxon" statement complemented with a "stated in" reference."

How does the Wikidata upserter take manual edits into account? Also, if an entry is manually deleted, how does the upserter know the statement was explicitly removed to make sure to now re-add a potentially erroneous entry.

In attempting to review and reproduce (parts of) workflow using provided documentation, https://gitlab.com/lotus7/lotus-processor/-/blob/7484cf6c1505542e493d3c27c33e9beebacfd63a/README.adoc#user-content-pull-the-repository was found to suggest to run:

git pull https://gitlab.unige.ch/Adriano.Rutz/opennaturalproductsdb.git

However, this command failed with error:

fatal: not a git repository (or any parent up to mount point /home)

Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

Assuming that the documentation meant to suggest to run:

git clone https://gitlab.com/lotus7/lotus-processor,

the subsequent command to retrieve all data also failed:

$ dvc pull

WARNING: No file hash info found for 'data/processed'. It won't be created.

WARNING: No file hash info found for 'data/interim'. It won't be created.

WARNING: No file hash info found for 'data/external'. It won't be created.

WARNING: No file hash info found for 'data/validation'. It won't be created.

4 files failed

ERROR: failed to pull data from the cloud – Checkout failed for following targets:

data/processed

data/interim

data/external

data/validation

Is your cache up to date?

Without a reliable way to retrieve the data products, I cannot review your work and verify your claims of reproducability and versioned data products.

Also, in https://gitlab.com/lotus7/lotus-processor/-/blob/7484cf6c1505542e493d3c27c33e9beebacfd63a/README.adoc#user-content-dataset-list, you claim that the dataset list was captured in xref:docs/dataset.tsv. However, no such file was found. Suspect a broken link with correct version pointing to xref:docs/dataset.csv.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "The LOTUS Initiative for Open Knowledge Management in Natural Products Research" for further consideration by eLife. Your revised article has been evaluated by Anna Akhmanova (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues (particularly those of Reviewer 2) that need to be addressed, as outlined below:

Reviewer #1 (Recommendations for the authors):

I'm quite happy with the authors' responses to my previous review. Thank you for addressing mine and the other reviewers' points carefully.

Reviewer #2 (Recommendations for the authors):

Rutz et al. describe the LOTUS initiative, a database which contains over 750,000 referenced structure-organism pairs. Present both the data which they have made available in Wikidata, as well as their interactive web portal (https://lotus.naturalproducts.net/). provides a powerful platform for mining literature for published data on structure-organism pairs.

The clarity and completeness of the manuscript has improved significantly from the first round to the second round of reviews. Specifically, the authors clarified the scope of the study and provided concrete, and clear examples of how they think the tool should/could be used to advance np research.

I think the authors adequately addressed reviewer concerns regarding the documentation of the work and provide ample documentation on how the LOTUS initiative was built, as well as useful documentation on how it can be used by researchers.

Strengths:

The Lotus initiative is a completely open source, database built on the principles of FAIRness and TRUSTworthiness. Moreover, the authors have laid out their vision of how they hope LOTUS will evolve and grow.

The authors make a significant effort to consider the LOTUS end-user. This is evidenced by the clarity of their manuscript and the completeness of their documentation and tutorials for how to use, cite and add to the LOTUS database.

Weaknesses/Questions:

1) The authors largely addressed my primary concern about the previous version of the manuscript, which was that the scope and the completeness of the LOTUS initiative was not clearly defined. Moreover, by providing concrete examples of how the resource can be used both clarifies the scope of the project and increases the chances that it will be adopted by the great scientific community. In the previous round of reviews, I asked "if LOTUS represents a comprehensive database". In their response, the authors raise the important point that the database will always be limited by the available data in the literature. While I agree with the authors on this point and do not think it takes away from the value of the manuscript/initiative, I think a more nuanced discussion of this point is merited in the paper.

While the authors say that the example queries provided illustrate that the LOTUS can be used to answer research questions, i.e. how many compounds are found in Arabipsis. However, interpreting the results of the query are far more nuanced. For example, can one quantitively compare the number of compounds observed in different species, or families based on the database (e.g. do legumes produce more nitrogen containing compounds than other plant families)? Or do the inherent biases in literature inhibit our ability to draw conclusions such as this? I like that the authors showcase the potential value of using the outlined queries as a starting place for a systematic review, as they highlight in the text. However, I would like to see the authors discuss the inherent limitations with using/interpreting the database queries given the incompleteness of NP exploration.

2) On page (16?) the authors discuss how the exploration of NP is rather limited, citing the fact that each structure is found on average in only 3 organisms, and that only eleven structures per organism. How do we know that these low numbers are a result of incomplete literature or research efforts and not due to the biology of natural products? For example, plant secondary metabolites are extremely diverse and untargeted metabolomics studies have shown that any given compound is found in only a few species. There is likely no way of knowing the answer to this question, but this section could benefit from a more in-depth discussion.

3) I do not understand Figure 4. I would like to see more details in the figure legend and an explanation of what an "individual" represents? Are these individual publications? Individual structure-organism pairs? Also, what do the box plots represent? Are they connected to the specific query of Arbidosis thaliana/B-sitosterol? If they are there should only be a value returned for the number of structures in Arabidopsis thaliana, rather than a distribution. Thus, I assume that the accumulation curve and the box plots are unrelated, and this should be spelt out in the figure legend or separated into two different panels.

4) On page 5 the authors introduce/discuss COCONUT. I am confused about how exactly COCONUT is related to or different from LOTUS. Can you provide some more context, similar to how the other databases were discussed in the preceding paragraph?

https://doi.org/10.7554/eLife.70780.sa1

Author response

Essential revisions:

1) Reviewers would like to see the authors clarify the scope of their work and address to what extent this database will provide the community with a comprehensive database of structure organisms pairs, that is actually useable to answer research questions. This could be achieved by adding concrete examples of how these data could be used to help readers/reviewers understand the scope.

We thank the editors and reviewers for these suggestions. The scope of our work is dual: LOTUS was designed both to help the gathering and exploitation of past NP research output (structure-organism pairs) and to facilitate future formatting and reuse. LOTUS doesn’t only provide a wide set of curated and documented structure-organisms pairs, but also a set of tools to gather, organize, and interrogate them. The results of this first output of the LOTUS initiative can thus be exploited in multiple ways by a wide range of researchers. We agree with the reviewers that mentioning concrete application examples is of importance.

Previous to this work, efficient access to specialized metabolites occurrences information was a complicated task. Since we implemented LOTUS, through Wikidata, anyone can easily access high-quality information about natural products occurrences in different ways (which organisms contain a given chemical compound? which compounds are found in a given organism?).

For example, with a simple click and without programming skills, anyone can look up the metabolome of Arabidopsis thaliana, on Scholia (https://scholia.toolforge.org/taxon/Q158695). This page and specifically the “Metabolome” section yields a table with relevant information related to the chemical structures (Wikidata identifiers, SMILE codes, etc.) and can be easily downloadable. Sharing such public data in an open and correctly formatted form is the main scope of this first project of the LOTUS initiative and a first concrete step to engage the NP community in such a data-sharing model.

To concretely illustrate the scope of our work, we listed multiple concrete examples of how these data could help answer relevant research questions in Table 2, which we further updated (see below) in our manuscript (https://lotusnprod.github.io/lotus-manuscript/#tbl:queries). This table lists a selected set of SPARQL queries that offer the reader concrete illustrations to exploit the LOTUS data available at Wikidata. Following the reviewer’s suggestions, we adapted the manuscript to advertise this central table better and earlier in the manuscript (See https://github.com/lotusnprod/lotus-manuscript/commit/1456cc82d6cb7a79a57f1403a053f9f0348fb351). Furthermore, readers are invited to have a look at the project page (https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/Natural_products) on Wikidata to fetch the most recent and updated SPARQL queries examples. This project page is also indicated to ask for help or suggest improvements.

To further showcase the possibilities opened by LOTUS, and also answer the remark on the comprehensiveness of our resource, we established two additional queries (https://w.wiki/4VGC and https://w.wiki/4VGC). Both queries were inspired by recent literature review works (https://doi.org/10.1016/j.micres.2021.126708 and https://doi.org/10.1016/j.phytochem.2021.113011). The first work describes compounds found in Actinobacteria, with a focus on compounds with reported bioactivity. The second one describes compounds found in Aspergillus spp., with a chemical focus on terpenoids. In both cases, within seconds, the queries allow retrieving a table similar to the ones in the mentioned literature reviews. Such queries are not meant to fully replace the valuable work behind the establishment of such review papers, however, they now offer a solid basis to start such works and liberate precious time to analyze and discuss the results of these queries.

Overall, the LOTUS initiative aims to support the NP community’s information retrieval and to free researchers' valuable time for tasks that cannot be automated.

In the manuscript, we precisely outlined our vision of the overall scope of the LOTUS initiative both with concrete application cases and more generic objectives (see https://github.com/lotusnprod/lotus-manuscript/commit/67c4fbdae130b737dfc6acc3da0b6099d5ac2fb9).

With this, we hope that the scope of this first project of the LOTUS Initiative has been clarified and properly exemplified and we remain at your disposal for any further illustrations.

2) Reviewers think the manuscript provides missing or incomplete documentation on the LOTUS processes and software, especially in terms of reproducibility of results. Providing open access to source code and data is an important first step, and additional (challenging!) steps are needed to help others independently build, and use, the provided software/data.

Reviewers were right. At the initial submission stage, multiple parts of the LOTUS pipeline were not fully accessible and/or poorly documented.

Taking this into account, we adapted and reformatted multiple parts of our different repositories.

First, we kept DVC (a data management tool – https://dvc.org) for our internal use only and added an option to programmatically access easily fetchable external data sources, for anyone to access the input data for workflow reproducibility and testing purposes.

Second, and since the full reproduction of the entire LOTUS processing workflow is time-consuming (> days), we added a minimal working example (https://github.com/lotusnprod/lotus-processor/blob/main/tests/tests_min.tsv) sampled from the original data to illustrate how the process works and anyone to be able to reproduce results from this subset.

Third, as suggested, we moved all repositories to a single place, on GitHub (https://github.com/lotusnprod). The code repositories now contain improved technical documentation, important steps periodically built by Continuous Integration, and additional features to improve user experience (see below). While previously focused on Linux systems only, we extended the portability of our workflow to macOS and Windows under WSL (https://github.com/lotusnprod/lotus-processor/actions/runs/1450330111). This should help to reach a broader audience.

Moreover, a “manual” mode has been implemented for each researcher to be able to upload its own documented structure-organism pairs to Wikidata. See instructions here https://github.com/lotusnprod/lotus-processor/wiki/Processing-your-own-documented-structure-organism-pairs.

As we built our tool for the NP community, additional requests will likely appear once more users start using it. The GitHub issue tracker allows collecting in a single place users’ problems and suggestions. We do not guarantee the absence of hiccups and assure the editors that we did our best for optimal reproducibility following the reviewer’s comment. Again, we are happy to improve and open to additional suggestions.

3) The authors have yet to provide compelling evidence of how their continuously managed (and updated) resources can be reliably cited in scholarly literature and incorporated in scientific workflows. In order to study and reference Wikidata, a versioned copy needs to be provided: Wikidata is updated constantly, and these constant stream of changes make it hard for others to verify results extracted from some older version of the Wikidata corpus unless a versioned copy is provided.

This is indeed a very pertinent point. Wikidata being a dynamic environment, the versioning aspects imply some challenges. However, versioning and tracking of the data dynamics can be achieved in multiple ways and at different levels:

Wikidata level:

The first way to access LOTUS data at a given time point is using the Wikidata images. Indeed, regular dumps (versions) of Wikidata exist and are available here https://dumps.wikimedia.org/wikidatawiki/entities. Querying them is indeed challenging and not the most convenient, this being a direct consequence of the amount of information they hold. However, these are regular snapshots that include the totality of the LOTUS data present on Wikidata. These regular dumps are realized independently of the LOTUS initiative members and thus offer increased durability concerning versioning.

LOTUS level:

A way to get around the problems posed by the amount of data in the versioning strategy detailed above (full Wikidata dumps) is to subset the Wikidata for entries of interest only. In our case the structures, biological organisms, bibliographic references, and the links among them. This can be done by the interested researchers using the https://github.com/lotusnprod/lotus-wikidata-interact bot.

We will also regularly upload results of a global SPARQL query on the LOTUS Initiative Zenodo Community repository (https://zenodo.org/communities/the-lotus-initiative) using this bot or future version of this bot on such datasets https://zenodo.org/record/5668855.

SPARQL level:

When using Wikidata as a source for LOTUS information, we invite users to share their query using the short URL and archive its results on a data repository such as Zenodo, OSF, or similar used and share it when releasing associated results. This will greatly help moving to better reproducible research outputs.

For example, the output of this Wikidata SPARQL query https://w.wiki/4N8G realized on the 2021-11-10T16:56 can be easily archived and shared in a publication https://zenodo.org/record/5668380.

The interested user can directly upload the results of their SPARQL queries on our LOTUS Initiative Community repository using the following link: https://zenodo.org/deposit/new?c=the-lotus-initiative.

Individual entry level:

At a more precise scale, each Wikidata entry has a permanent history. Therefore, versioning of single queries is automatically done for each entry of the LOTUS data and can be accessed by everyone. See here for the full history of erythromycin https://www.wikidata.org/w/index.php?title=Q213511&action=history.

To summarize, given its Wikidata-based hosting, LOTUS data is expected to evolve. While versioning is not impossible, it requires different means than the one used for more classical and static databases. It is to be noted that each of the queries of the researchers can be easily versioned and shared with the community and that the individual history of each of LOTUS central objects (chemical structures, biological organisms, and references) is kept and fully accessible on the history page of each Wikidata item entry. Altogether, we hope that the means proposed above provide good ways of tracking and versioning LOTUS entries.

Some versioning mechanism propositions were added to the manuscript: see https://github.com/lotusnprod/lotus-manuscript/commit/92833166375aa9cc29f44cffdd5ad693fa42934c

Reviewer #1 (Recommendations for the authors):

“A third fundamental element of a structure-organism pair is a reference to the experimental evidence that establishes the linkages between a chemical structure and a biologicl organism and a future-oriented electronic NP resource should contain only fully-referenced structure-organism pairs.”

Typo in "biologicl"

We thank the reviewer for carefully reading, and we corrected the typo.

“Currently, no open, cross-kingdom, comprehensive, computer-interpretable electronic NP resource links NP and their producing organisms, along with referral to the underlying experimental work”.

missing "that"

We couldn’t find the missing “that” in this sentence.

"KNApSAck currently contains 50,000+ structures and 100,000+ structure-organism pairs. However, the organism field is not standardized and access to the data is not straightforward".

This is the first opportunity in the manuscript to describe in more detail the perils of previous databases, especially the mess that is KNApSAck, which you have no choice but to work on because of its ubiquity.

As answered to Reviewer 1 – question N° 3, it was a deliberate choice not to emphasize the perils/weaknesses of currently available DBs, as we use them, but to focus on better habits for the future. We didn’t want to start devaluating resources or elevate some at the cost of others.

“NAPRALERT is not an open platform, employing an access model that provides only limited free searches of the dataset”.

There's an awful lot of praise for this database given that it is directly antithetical to the manuscript. Please provide further commentary contrasting the work in NAPRALERT (particularly, about its shortcomings as a closed resource) to LOTUS.

NAPRALERT maintenance and information input were greatly reduced in the last decade. And as such, its limitations are showing even more. But it proved to be a valuable resource for LOTUS. Its current maintainers, which all contributed to LOTUS and this manuscript decided to donate all its chemical-related data to the project so it could be made available for all. The quality of chemical annotations in NAPRALERT is rather poor as it doesn't have direct structures or structural information but mainly chemical names requiring extensive use of chemical translators (a quite error-prone process see the answer to question 9.) However, the bibliographical data is of much higher quality than other resources and the organism data as well, despite many of the names have changed with recent taxonomic revisions. Overall, and as we answered the previous question, our choice in this manuscript was not to compare or list shortcomings of existing data sources but rather propose new directions for the future.

“FAIR and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) principles”.

If you really want to go down the buzzword bingo, I'd suggest making a table or going in depth into each point. These acronyms are, in my opinion, effectively meaningless from a technical perspective, so it falls on the authors of the paper who want to use them (as you are no doubt pressured to do in the modern publishing landscape) to define them and qualify their relevance to your more practical goals.

The categorization of these acronyms as meaningless buzzwords or valuable and ideal guidelines to be met is indeed, and happily, the responsibility and liberty of each researcher. As long as these terms will be considered as buzzwords and not guidelines there will be work to do for the advancement of open research. After deliberation among the authors, we believed that because LOTUS checks both boxes of FAIR (https://en.wikipedia.org/wiki/FAIR_data) and TRUST (https://www.nature.com/articles/s41597-020-0486-7), we would rather stick to these two terms. As stated in the TRUST white paper “However, to make data FAIR whilst preserving them over time requires trustworthy digital repositories (TDRs) with sustainable governance and organizational frameworks, reliable infrastructure, and comprehensive policies supporting community-agreed practices.” We strongly believe that Wikidata is such a repository. Regarding both terms, we felt that redefining alternative ones would not help the reader/user.

“…any researcher to contribute, edit and reuse the data with a clear and open CC0 license (Creative Commons 0).”

This is a really interesting point considering you have taken information from several unlicensed databases and several that have more permissive licenses. How do you justify this?

We thank the reviewer for this comment, it is indeed central to the work. These are our justifications:

– First, we never strictly copied a DB. We carefully took relevant pieces of information as allowed per the Right to quote.

– Second, there was significant work of harmonization and curation including manual validation steps. This alone should justify the possibility of dissemination under another license.

– Finally, we contacted all authors of DBs whose licensing status was unclear in our view, showing goodwill, and will retract any data on request.

In the end, who could, in any ethical consistency, legally own the fact that a given natural product is present in a given organism?

“The SSOT approach consists of a PostgreSQL DB that structures links and data schemes such that every data element has a single place”.

Why is Wikidata not the single source of truth? This means that the curation and generation of this dataset can never really be decentralized- it will always have to be maintained by someone who is the maintainer of the PostgreSQL database. What are the pros/cons for this?

We thank the reviewer for bringing up this crucial point. Here, there are multiple elements to take into consideration:

First, Wikidata does not currently allow sharing properties of interest for the NP community (e.g. structural depiction, spectra, etc.). Having them in a specialized DB for the moment can help. If the community is happy with the offered resource, it is much more likely it will contribute to the project and push towards better, accepted, Wikidata-compatible standards.

Second, even if sad, intentional, or unintentional vandalism exists on Wikidata. Having a decoupled PostgreSQL can therefore be seen as additional security.

Of course, this makes the maintainability of the minimal “non-Wikidata” part of LOTUS as weak as the current criticized system, but the very large part being already on Wikidata, we consider it as a mere backup system We further consider it as a necessary transition step for the community to adhere to our initiative.

Now that the initial release of the LOTUS dataset is on Wikidata it is indeed decentralized as data addition/correction/deletion can come from any of the Wikidata users. We will be happy to retire the non-Wikidata part of LOTUS as soon as it will not serve any purpose anymore.

“The LOTUS processing pipeline is tailored to efficiently include and diffuse novel or curated data directly from new sources or at the Wikidata level.”

This sentence is confusing

We thank the reviewer and correct the sentence as follows: “The LOTUS processing pipeline is tailored to efficiently include and diffuse novel curated data from new sources. These sources can be new open external resources or additions made directly to Wikidata.”

“All stages of the workflow are described on the git sites of the LOTUS initiative at https://gitlab.com/lotus7 and https://github.com/mSorok/LOTUSweb

Later this became a problem for code review. Why are the code all in different places? There are "organizations" both on GitLab and GitHub to keep related code together. Further, putting code in a personal user's namespace makes it difficult to for potential community involvement, especially if the user becomes inactive. Also a matter of opinion, but most science is being done on GitHub. Using GitLab will likely limit the number of people who will interact with the repository. Highly suggested to move it to GitHub.

This is a good remark. Code was initially split because of work repartition among the team. We tried to centralize everything on GitLab, which had our favor at the time. Taking the reviewer considerations into account, we moved all repositories to GitHub under the lotusnprod organization https://github.com/lotusnprod. As we want users to interact as much as possible with LOTUS, if this can increase, as the reviewer suggests, the number of interactions, we will be happy.

“All necessary scripts for data gathering and harmonization can be found in the lotus-processor repository in the src/1_gathering directory."

The reference to "SI 1 Data Sources List" does not include the actual license information in the table, but rather links to READMEs (which may not stay stable for the life of the manuscript). This should explicitly state which license each database has. Further, it seems a bit disingenuous since some of these links point to README pages that state there is no license given, such as for Biofacquim (and several others)

https://gitlab.com/lotus7/lotus-processor/-/blob/main/docs/licenses/biofacquim.md

How can you justify taking this content and redistributing it under a more permissive license?

We thank the reviewer for this comment and updated the table accordingly. Having all this information is challenging. We are sorry for the questionable format of the table but strongly disagree with “disingenuous”. All searches were made ingenuously. Regarding the biofacquim example, it is a really good one since multiple versions were published along the road of the manuscript. As mentioned in the license.md file (“But the [article](https://www.mdpi.com/2218-273X/9/1/31) is under [Creative Commons Attribution License] (https://creativecommons.org/licenses/by/4.0/)”), MDPI clearly states:

“No special permission is required to reuse all or part of article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. Reuse of an article does not imply endorsement by the authors or MDPI.“

We cited the original article clearly and we do not see any restriction issue since the distributed content is not copied from Biofacquim but parts of it that have undergone significant processing steps.

Further, this table should contain versioning information for each database and a flag as to whether it is still being maintained, whether data can be accessed as a dump, and if there is a dump, if it is structured. Right now, the retrieval column is not sufficient for describing how the data is actually procured.

We added the requested flag. The table (https://github.com/lotusnprod/lotus-processor/blob/main/docs/dataset.csv) now contains a boolean “dump” column, together with a column defining the status of the resource (maintained, unmaintained, retired or defunct), and a column with the timestamp of the last input modification. As the majority of the resources do not provide versioning, the timestamp corresponds to when the query was manually performed last. The input might of course have changed in the meantime however the git history of this table will allow tracking these evolutions.

Can you cross-reference these databases to Wikidata pages? FAIRSharing pages? I'm sure following publication in the near- or mid-term, more of these databases will go down permanently, so there should be as much information about what they were available with this publication.

After deliberation, we decided to link the structure-organisms pairs on Wikidata pages only to the original experimental work documenting the occurrence (the scientific publication). The goal of this LOTUS initiative project was to move away from the classical natural products databases system. Just like we don’t refer to LOTUS for a documented pair, we don’t refer to any other DB (what is the scientific interest of listing the x DB documenting quercetin in Quercus sp ?) but simply the original experimental work.

Another interesting point raised is the fact that the resources might go down, permanently or not. Since the beginning of our initiative, many of them are not openly accessible anymore (wakankensaku, mitishamba), or their licensing status changed (alkamid, tcmid translation files). This is an additional reason to refer to the original work directly.

Small suggestion: right align all numbers in tables.

We appreciate the suggestion, we right-aligned the numbers in tables.

“This process allowed us to establish rules for automatic filtering and validation of the entries. The filtering was then applied to all entries.”

Please explain this process, in detail (and also link to the exact code that does it as you have done in other sections)

The process in detail is described in the methods section. We added the link to the exact code and a sentence linking the results to the corresponding method.

“Table 1: Example of a referenced structure-organism pair before and after curation”.

Would like some discussion of the importance of anatomical region in addition to organism. Obviously, there is importance in the production of the NP in a given region. This is highly granular and makes the problem more difficult, but even if you don't tackle it, it has to be mentioned.

The discussion opened here is of great interest, but probably out of scope for the paper. We aim to establish the basis for better reporting, not solving the problem until the finest granulometry. This could lead to endless discussions, not only about organs but also open the door to many other topics. Saying “quercetin found in Quercus sp.” is anyway a proxy. It was not found in all Quercus sp. but in a specific extract of a specific organism (if lucky with a voucher ID), growing in a specific location, with specific endophytes (probably responsible for the production of many compounds, for an estimation: 10.1186/s40793-021-00375-0), specific climate, etc. This also explains why we deliberately chose the found in taxon property (https://www.wikidata.org/wiki/Property:P703) and not the natural product of taxon one (https://www.wikidata.org/wiki/Property:P1582). While those aspects certainly are very interesting challenges for the coming years, we cannot address such precision levels in the current work.

Further, why is the organism not standardized to a database identifier? Why are the prefixes to go along with these ID spaces not mentioned (InChI-key for structure, DOI for reference, and what for organism?) Maybe one of the confounders is that there are many taxonomical databases for organisms with varying coverage – this should be addressed. On second pass through the review, I noticed this was described a bit more in the methods section, so this should be linked. Further, it's actually not clear even by the end of the methods section what identifiers are used in the end. Are Wikidata Q identifiers the single unifying identifier in the end?

Indeed, taxa are more complex to handle than inchikeys and dois. As mentioned, all taxon names are linked to (1) a taxonomic database (2) the ID corresponding to this DB. And because of the known disagreements between taxonomic DBs, this pair constitutes the “ID”. A taxon name could correspond to different taxa in different DBs. As Wikidata still uses taxa names as IDs to link them to taxonomic DBs, we did the same and uploaded each identifier we found through GNames. So in the end, the 3 identifiers used to decide whether to create or update a QID are inchikey, doi, taxon name (accepted in a taxonomic db accepted within Wikidata taxonomy community). As the reviewer states at the end of its comment, Wikidata id of a taxon name can indeed be seen as efficient means to identify a biological organism as Wikidata in this case acts as a taxonomical aggregator and allows to moderate disagreements between taxonomists (who often disagree …). See taxa graph on the scholia page of Lotus halophilus for example https://scholia.toolforge.org/taxon/Q15435646.

“Figure 2: Alluvial plot of the data transformation flow within LOTUS during the automated curation and validation processes.”

Looks like more things are rejected than kept. What about having a human curator in the loop? Or should poorly curated stuff be thrown way forever? There was a small mention about building a human curation system, but I don't think this had the same focus.

Indeed, we favored quality over quantity, as highlighted by the F-scores obtained after automated curation. As mentioned, human curation will take place later on, entries are not thrown away forever but are not of a sufficient quality to be sent as-is. We adapted the text accordingly.

A curation interface is actually in development, with access to part of the SSOT, and technical details for access, logging, etc. are still in discussion. This curation interface will, as mentioned, allow us to upload many entries we did not upload at the moment. Following the reviewer question, we added the non-uploaded part of our processed entries to Zenodo under the following link: 10.5281/zenodo.5794596.

“The figure highlights, for example, the essential contribution of the DOI category of references contained in NAPRALERT towards the current set of validated references in LOTUS”.

Is this to say LOTUS is constructed with the closed data in NAPRALERT? Or the small portion that was made open was sufficient?

As previously mentioned, part of the closed data of NAPRALERT was donated to LOTUS. This part only consists of the minimally needed triplet: structure-organism-reference. The sentence was written to reinforce the fact that identifiers, such as DOI, lead to better results than other types of data. We removed the sentence concerning NAPRALERT and adapted the text.

“…any researcher has a means of providing new data for LOTUS, keeping in mind the inevitable delay between data addition and subsequent inclusion into LOTUS.”

How likely do you think users will be to contribute their data to LOTUS, when there are many competing interests, like publishing their own database on their own website, which will support their own publication (and then likely go into obscurity)?

Ultimately, this means there should be a long-term commitment from the LOTUS initiative to continue to find new databases as they're published, to do outreach to authors (if possible/necessary), and to incorporate the new databases into the LOTUS pipeline as a third party. This should be described in detail – which organizations will do this/how will they fund it?

We thank the reviewer for pointing this out. The sentence was indeed wrongly formulated. What the authors wanted to express was “new data for the community, via Wikidata (for LOTUS)”. As written in the text, the NP community suffers from sub-optimal data-sharing habits and the authors would like to participate in positive changes on these aspects. Without the help of publishers, it might take some time, but spectral repositories clearly show it is feasible. We are convinced users and maintainers will adopt this change, as it also clearly lowers the fundings needed.

“For this, biological organisms were grouped into four artificial taxonomic levels (plants, fungi, animals and bacteria).”

How? Be more specific about methodology including code/dependencies

An additional sentence linking to the related method has been added. More details are also given in the main text.

“Figure 6: TMAP visualizations of the chemical diversity present in LOTUS. Each dot”.

This color scheme is hard to interpret. Please consider using figure 6 in https://www.nature.com/articles/s41467-020-19160-7 to help pick a better color scheme.

We thank the reviewer for this suggestion. We tried our best with color palettes on all figures, some of them being challenging. While mainly aware of categorical color-blind friendly color palettes, we did not know the batlow palette. We implemented it and revised all TMAPs coloring.

“The biological specificity score at a given taxonomic level for a given chemical class is calculated as the number of structure-organism pairs within the taxon where the chemical class occurs the most, divided by the total number of pairs.”

Have you considered the Kullback-Leibler divergence (https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) which is often used in text mining to compare the likelihood of a query in a given subset of a full corpus? I think this would also be appropriate for this setting.

We thank the reviewer very much for this suggestion. After having looked at Kullback-Leibler divergence and its limitations, we found it more appropriate to use Jensen-Shannon divergence as it is both symmetric and bounded. We compared this divergence to our initial heuristic equation and chose to implement it.

“This specificity score was calculated as in Equation 2:”

This is a very similar look to the classic Jaccard Index (https://en.wikipedia.org/wiki/Jaccard_index; which is skewed by large differences in the size of the two sets) and the overlap coefficient (https://en.wikipedia.org/wiki/Overlap_coefficient). I think the multiplication of the sizes of the sets will have the same issue as Jaccard index. What is the justification for using this non-standard formulation? Please comment if the conclusions change from using the overlap coefficient.

As for the previous equation, we thank the reviewer for pointing this out. Again, our equation was purely heuristic and its initial form was:

StructuresinchemicalclassStructuresintaxonStructuresinchemicalclass(chemicalpart) StructuresinchemicalclassStructuresintaxonStructuresintaxon(taxonpart)

We did not realize it was so close to a Jaccard Index, so we did compute both the Jaccard index and the overlap coefficient.

Author response image 1
Left heuristic vs right Overlap index.

While Jaccard is a good alternative to our initial score, with an interesting increase in small values, the overlap index flattens differences too much in our view, as shown in Figure This is due to the min() in the equation, given classes being too small. In all three investigated metrics, even if the height of the bars changes, the color pattern on the biological tree remains almost unchanged.

Since this is supporting Figure 7, a new way to explore the chemical and biological diversity, and you previously wrote the huge skew in organisms and taxa over which information is available, it would be important to comment on the possible bias in this figure. Were you trying to make the point that the NPClassifier system helps reduce this bias? If so, it was unclear.

We thank the reviewer for pointing out this unclarity. Our goal was not to make the point that NPClassifier helps to reduce this bias. We favored it over Classyfire simply because it was tailored to describe NPs and thus has the most interesting classes when describing the chemistry of Life. We adapted the text.

With the mentioned limitations concerning studied organisms, LOTUS still appears as the most comprehensive resource to display such chemical and biological diversity.

“As LOTUS follows the guidelines of FAIRness and TRUSTworthiness, all researchers across disciplines can benefit from this opportunity, whether the interest is in ecology and evolution, chemical ecology, drug discovery, biosynthesis pathway elucidation, chemotaxonomy, or other research fields that connect with NP.”

This was emphatically not qualified within the manuscript. The expansions for FAIR and TRUST were given, but with no explanation for what they are. Given that these are buzzwords that have very little meaning, it would either be necessary to re-motivate their importance in the introduction when you mention them, then refer to them throughout, or to skip them. That being said, I see the need for authors to use buzzwords like this to interest a large audience (who themselves do not necessarily understand the purpose of the concepts).

We addressed this point in a previous remark, it should be clearer now.

“This project paves the way for the establishment of an open and expandable electronic NP resource.”

By the end of the conclusion, there was no discussion of some of the drawbacks and roadblocks that might get in the way of the success of your initiative (or broadly how you would define success). I mentioned a few times already in this review places where there were meaningful discussions that were overlooked. This would be a good place to reinvestigate those.

Thanks for the suggestion, we updated the conclusion part of the manuscript with a shortcomings and challenges section. See https://github.com/lotusnprod/lotus-manuscript/commit/a866a01bad10dfd8b3af90e2f30bb3ae51dd7b9e.

Methods

I'm a huge opponent to the methods section coming after the results. There were a few places where the cross-linking worked well, but when reading the manuscript from top to bottom, I had many questions that made it difficult to proceed. The alternative might not be so good either, so I understand that this likely can't be addressed.

Thank you for your suggestions here. We thought about the remodeling implied by shifting the methods section and decided to conserve the document structure as is. We understand it is not optimal. We have, however, added crosslinks between results and methods sections.

“Before their inclusion, the overall quality of the source was manually assessed to estimate, both, the quality of referenced structure-organism pairs and the lack of ambiguities in the links between data and references.”

How was quality assessed? Do you have a flowchart describing the process?

This question is highly interesting. The quality assessment relies mainly on human experience. Basically, some random documented pairs were chosen and their plausibility evaluated. Even if some projects (https://doi.org/10.33774/chemrxiv-2021-gxjgc-v2) aim to automatically link biological sources to chemical structures, it is currently difficult to automate.

Where, for trained NP chemists, it might be easier to quickly discard some DBs reporting aberrant structures in some organisms. Intentionally, we have not focused on the rejected in order not to "point the finger" at them too much, as this is not the primary aim of our study. Regarding the second part, some DBs simply document all pairs to all references, in a way it is almost impossible to find the reference documenting the pair. This also needs human evaluation before computational rules.

“Traditional Chinese Medicine (TCM) names, and conversion between digital reference identifiers, no solution existed, thus requiring the use of tailored dictionaries.”

How were these made? Where are the artifacts? Is it the same as the artifacts linked in the later section? Please cross link to them if so.

Yes, they are the ones mentioned later in the methods. We added a linking sentence.

“All pre-translation steps are included in the preparing_name function and are available in src/r/preparing_name.R.”

For this whole section: why were no NER methods applied?

This section and other ones would highly benefit from NER. However, NER would have to be applied to the original publications, and not on already preformatted fields, which make NER less powerful. We believe some work has to be done on the original articles but this implies other (also legal access) restrictions. Projects such as DECIMER (https://doi.org/10.1186/s13321-020-00469-w) will greatly help in this direction.

“When stereochemistry was not fully defined, (+) and (-) symbols were removed from names.”

Given the domain, this seems like a huge loss of information. Can you reflect on what the potential negative impacts of this will be? This also comes back to my other suggestion to better motivate the downstream work that uses databases like LOTUS – how could it impact one of those specific examples?

It is actually no loss of information, but simply a curation process of the structure name field. This was carried to avoid non-existing information from being invented. Maybe the sentence was not clear, but, sometimes, with an isomeric SMILES, the generated name contained (+)/(-), which is not true. To be on the safe side, and not attribute a random stereochemistry, we removed it.

Many structures reported with fully defined stereochemistry in the original articles were then sadly degraded when imported into the source DBs of this work. This is an additional reason we want to push towards direct reporting of the pairs from the article to Wikidata. So, many of the structures we added have no stereochemistry information while it might exist but we opted for the safe solution. In our view, the negative impact would have been to try doing the opposite.

Having non-fully defined structures is already of great interest to many communities, such as the Mass Spec community, where, for example, the dereplication of structures can be carried without stereochemistry.

“After full resolution of canonical names, all obtained names were submitted to rotl (Michonneau et al., 2016)”.

missing reference

We thank the reviewer for carefully reading, we added the correct reference.

Reviewer #2 (Recommendations for the authors):

Overall well written with few errors.

page 14:

"Targeted queries allowing to interrogate LOTUS data from the perspective of one of the three objects forming the referenced structure-organism pairs can be also built."

Should be "can also be built"

We thank the reviewer for careful reading and corrected the sentence

Some of the figures after page 68 appear to be corrupted.

We saw it and contacted the eLife editorial office about it. It might be an internal error only, all figures being displayed well in https://lotusnprod.github.io/lotus-manuscript.

Reviewer #3 (Recommendations for the authors):

[…]

re: "LOTUS employs a Single Source of Truth (SSOT, Single_source_of_truth) to ensure data reliability and continuous availability of the latest curated version of LOTUS data in both Wikidata and LNPN (Figure 1, stage 4). The SSOT approach consists of a PostgreSQL DB that structures links and data schemes such that every data element has a single place."

Single source of truth is advocated by authors, but not explicitly referenced in statements upserted into Wikidata. How can you trace the origin of LOTUS Wikidata statements to a specific version of the SSOT LOTUS postgres db?

Concerning the Wikidata statements, we clearly want to avoid any attribution to sources DB (LOTUS or another one). The documentation of the structure-organism must be the original publication, without intermediate. While it is true that, actually, a lot of information is retrieved through specialized databases, we argue this practice should stop and move towards what we are suggesting, direct access to the original work on Wikidata. If, in the future data regarding structure-organisms pairs will be directly edited/uploaded to Wikidata, this whole issue would be solved. Regarding the SSOT, it is versioned internally, and we match each Wikidata statement through its 3 identifiers: organism name, inchikey and doi.

re: Figure 1: Blueprint of the LOTUS initiative.

Data cleaning is a subjective statement: what may be "clean" data to some, may be considered incomplete or incorrectly transformed by others. Suggest to use less subjective statement like "process", "filter", "transform", or "translate".

We thank the reviewer for the suggestion. We adopted a more neutral wording. We adapted the figure, the text and the repository architecture, replacing all “clean-” with “process-”.

re: "The contacts of the electronic NP resources not explicitly licensed as open were individually reached for permission to access and reuse data. A detailed list of data sources and related information is available as SI-1."

The provided dataset references provide no mechanism for verifying that the referenced data was in fact the data used in the LOTUS initiative. A method is provided for versioning the various data products across different stages (e.g., external, interim, processed validation), but README was insufficient to access these data (for example see below).

While we agree such a mechanism would be helpful, the first part of the remark applies to all published science, not only to our initiative. As previously explained , we do not want to redistribute original content. We made all our distributable data accessible, and the totality of our processing pipeline, even for not distributed data accessible.

We improved the documentation and data access in order to facilitate reproducibility of all steps that can be.

re: "All necessary scripts for data gathering and harmonization can be found in the lotus-processor repository in the src/1_gathering directory. All subsequent and future iterations that include additional data sources, either updated information from the same data sources or new data, will involve a comparison of the new with previously gathered data at the SSOT level to ensure that the data is only curated once."

re: "These tests allow a continuous revalidation of any change made to the code, ensuring that corrected errors will not reappear."

How do you continuously validate the code? Periodically, or only when changes occur? In my experience, claiming continuous validation and testing is supported by active continuous integration / automated testing loops. However, I was unable to find links to continuous testing logs on platforms like travis-ci.org, github actions, or similar.

Code is continuously validated each time changes are pushed to the repository through the CI/CD. It was previously implemented in gitlab (https://gitlab.com/lotus7/lotus-processor/-/pipelines). Since we moved the repository to GitHub as requested by reviewer 1, we made a new CI pipeline available at: https://github.com/lotusnprod/lotus-processor/actions. The pipeline goes through each step of the processing with a minimal test example. As detailed previously, we also implemented new features, such as the “custom” mode (allowing the user to give its own list of documented structure-organism pairs) and this one is also tested through the CI.

re: "LOTUS uses Wikidata as a repository for referenced structure-organism pairs, as this allows documented research data to be integrated with a large, pre-existing and extensible body of chemical and biological knowledge. The dynamic nature of Wikidata fosters the continuous curation of deposited data through the user community. Independence from individual and institutional funding represents another major advantage of Wikidata."

How is Wikidata independent from individual and institutional funding? Assuming that it takes major funding to keep Wikidata up and running, what other funding source is used by Wikidata beyond individual donations and institutional grants/support?

We thank the reviewer for pointing this out, our sentence was indeed unclear. We adapted it in the manuscript.

What we wanted to express is the precarity of many academic tools, relying on a single funding, or PhD student/ postdoctoral student only. Wikidata benefits from a vast portfolio of fundings, as described in our answer to reviewer 1, thus making it less vulnerable. As long as it will be useful to the community there is no doubt it will be funded.

re: "The openness of Wikidata also o ers unprecedented opportunities for community curation, which will support, if not guarantee, a dynamic and evolving data repository."

Can you provide example in which community curation happened? Did any non-LOTUS contributor curate data provided by LOTUS initiative?

Here is an example of cholesterol entry dynamic evolution: https://www.wikidata.org/w/index.php?title=Q43656&action=history.

The KrBot removed some duplicated entries and redirected some other ones. Clerodendrum fragrans (Q15622830), for example, became Clerodendrum chinense (Q10475331) and our statement was updated by this non-LOTUS contributor.

"As the taxonomy of living organisms is a complex and constantly evolving eld, all the taxon identi ers from all accepted taxonomic DB for a given taxon name were kept. Initiatives such as the Open Tree of Life (OTL) (Rees and Cranston, 2017) will help to gradually reduce these discrepancies, the Wikidata platform should support such developments."

Discrepancies, conflicts, taxonomic revisions, and disagreements are common between taxonomies due to scientific differences, misaligned update schedules, and transcription errors. However, Wikidata does not support organized dissent, instead forcing a single (artificial) taxonomic view onto commonly used names. In my mind, this oversimplifies taxonomic realities, favors one-size-fits-none taxonomies, and leaves behind the specialized taxonomic authorities that are less tech savvy, but highly accurate in their systematics. How do you imagine dealing with conflicting, or outdated, taxonomic interpretations related to published names?

The reviewer is right about the complexity of taxonomy. However, Wikidata does not favor a “one size fits all” approach and in fact actually supports organized dissent. Please see https://www.wikidata.org/wiki/Wikidata:WikiProject_Taxonomy for additional background on the matter.

For example, https://www.wikidata.org/wiki/Q161265 supports two parent taxa, leading to a divergent graph https://w.wiki/453a.

One of the weak points might be in the distinction between a taxon and a taxon name, but this problem is far from being addressed in current natural products literature.

Within our pipeline, outdated taxon names should be linked to their corresponding accepted ones, and therefore we add structures reported in old taxon names to their currently accepted ones. If these become outdated, and a new accepted name is linked to it within Wikidata, the data will continue to follow.

"The currently favored approach to add new data to LOTUS is to edit Wikidata entries directly. Newly edited data will then be imported into the SSOT repository. There are several ways to interact with Wikidata which depend on the technical skills of the user and the volume of data to be imported/modi ed."

How do you deal with edit conflicts? E.g., source data is updated around the same time as an expert updates the compound-organism-reference triple through Wikidata. Which edit remains?

Rule: Robot has a memory of what it sent and will never try to send it again, unless manually overridden. As such, the robot trusts humans and other robots. If someone does an edit (suppression, modification) the robot will not try to send it again. In case of vandalism three approaches can be used:

– manual undo (for small cases)

– Wikidata admins undo (for large abuses)

– robot reupload (in case Wikidata admins cannot deal with the vandalism)

So, in case of manual addition by an expert on Wikidata, the expert addition remains. In case of manual deletion by an expert, the expert deletion remains.

We clarified this in the manuscript also (see https://github.com/lotusnprod/lotus-manuscript/commit/6f39237d23980cf3af7c78e4ff7c2a2b4cc64629).

"Even if correct at a given time point, scientic advances can invalidate or update previously uploaded data. Thus, the possibility to continuously edit the data is desirable and guarantees data quality and sustainability. Community-maintained knowledge bases such as Wikidata encourage such a process."

On using LOTUS in scientific publications, how can you retrieve the exact version of Wikidata or LOTUS resource referenced in a specific publication? In my understanding, specific Wikidata versions are not easy to access and discarded periodically.

Regarding the versions, in our view, it is the responsibility of the authors of the scientific publication to freeze the version and provide it, with possible modifications, to the reviewers and to the public. We provide the query to export all LOTUS pairs from Wikidata, so the authors of a scientific publication just have to run the query, timestamp it, use it as they want and provide the version they used. This allows full reproducibility while keeping the “living” aspect of Wikidata. See also comments to editors regarding possible versioning mechanisms. E.g. “We will also regularly upload results of a global SPARQL query on the zenodo repository using the https://github.com/lotusnprod/lotus-wikidata-interact or future version of this bot on such datasets https://zenodo.org/record/5668855” or “For example the output of this Wikidata SPARQL query https://w.wiki/4N8G realized on the 2021-11-10T16 :56 can be easily archived and shared in a publication https://zenodo.org/record/5668380.”

On the other hand, on lotus.naturalproducts.net, we implemented a versioning for users. So they can simply refer to the version they downloaded from the website.

We agree with the reviewer, Wikidata versions are difficult to query for specific natural products information. Wikidata Toolkit (https://www.mediawiki.org/wiki/Wikidata_Toolkit) might offer options to query them. However, nothing is discarded.

Some versioning mechanism proposition were added to the manuscript: see https://github.com/lotusnprod/lotus-manuscript/commit/92833166375aa9cc29f44cffdd5ad693fa42934c.

Please provide an example of how you imagine LOTUS users citing claims provided by a specific version of LOTUS SSOT to Wikidata.

LOTUS does not directly claim occurrences of chemical structures within biological organisms. The claims are directly supported by the bibliographic reference.

"The LOTUS initiative provides a framework for rigorous review and incorporation of new records and already presents a valuable overview of the distribution of NP occurrences studied to date."

How can a non-LOTUS contributor dissent with a claim made by LOTUS, and make sure that their dissent in recorded in Wikidata or other available platforms? Some claims are expected to be disputed and resolved over time. Please provide an example of how you imagine dealing with unresolved disputes.

The easiest way is to contribute directly to Wikidata. Some tutorials are given in the SI to create new statements, also showing the way on how to modify one. Wikidata, thanks to its ever updated content, will always be “one step ahead” of LOTUS. We will periodically re-import Wikidata and we will be able to see:

– new statements corresponding to our criteria

– statements we made that were modified

In the case of modification of statements we previously uploaded, Wikidata always has the priority. The only way to deal with unresolved disputes is manually.

We take this opportunity to link this entry we created, which raised words of caution discussed openly, with details on the reason why: https://www.wikidata.org/wiki/Talk:Q104916955

"Community participation is the most e cient means of achieving more comprehensive documentation of NP occurrences, and the comprehensive editing opportunities provided within LOTUS and through the associated Wikidata distribution platform open new opportunities for collaborative engagement."

Do you have any evidence to suggest that community participation is the most effective way to achieve more comprehensive documentation of NP occurrences? Please provide references or data to support this claim.

Our phrasing was not accurate and we modified this section. We thank the reviewer again for careful reading.

"In addition to facilitating the introduction of new data, it also provides a forum for critical review of existing data, as well as harmonization and veri cation of existing NP datasets as they come online."

Please provide example of cases in which a review led to a documented dispute that allowed two or more conflicting sources to co-exist. Show example of how to annotate disputed claims and claims with suggested corrections.

As LOTUS data is still “young”, it is difficult to find such a discussion, although we already found an example: https://www.wikidata.org/wiki/Talk:Q104916955. More advanced examples exist, such as for some Aloe sp. (https://www.wikidata.org/wiki/Talk:Q145534). One of the most developed discussions concerning taxa is about Boletus erythropus. Under https://www.wikidata.org/wiki/Q728666, users can read about how useful those discussions can be. (A query listing all organisms, linked to compounds, with an open discussion page is available at https://petscan.wmflabs.org/?psid=20892472).

How do imagine using Wikidata to document evidence for a refuted "compound-found-in-organism claimed-by-reference" claim?

While it is still difficult to imagine the whole community discussing claims on Wikidata, it offers the right infrastructure to do it. As an example, the place of burial of Albert Einstein has been discussed on: https://www.wikidata.org/wiki/Talk:Q937. This would also be possible for natural products occurrences, with experts discussing the statements made on the compound page. Just like item pages, the talk pages are versioned. See also previously mentioned examples.

Here also the implementation of the Evidence Ontology (http://obofoundry.org/ontology/eco.html) might be a good direction to look at and further characterize and discuss the pertinence of LOTUS data. Such statements could for example complement the “stated in” reference for a natural product occurrence and document more formally the type of evidence (eg. isolated, NMR, crystals etc.)

"Researchers worldwide uniformly acknowledge the limitations caused by the intrinsic unavailability of essential (raw) data (Bisson et al., 2016)."

Note that FAIR does not mandate open access of data, only suggests vague guidelines to find, access, integrate and re-use possibly resources that may or may not be access controlled.

If you'd like to emphasize open access in addition to FAIR, please separately mention Open Data / Open Access references.

We thank the reviewer for pointing this out, we added Open Data in addition to FAIR.

re: "We believe that the LOTUS initiative has the potential to fuel a virtuous cycle of research habits and, as a result, contribute to a better understanding of Life and its chemistry."

LOTUS relies on the (hard) work of underlying data sources to capture the relationships of chemical compounds with organisms along with their citation reference. However, the data source are not credited in the Wikidata interface. How does *not* crediting your data sources contribute to a virtuous cycle of research habits?

We understand the reviewer comment, and this is related to our wish to break with previous research habits cycles. We argue that data sources must shift from the old DB model to the original publication, which is the only source allowing to (in)validate the statement. We do not want to fall in an ever ending circle of a pair originally described in article X but then extracted from Knapsack, then from NPAtlas and COCONUT. The credit goes to the author who did the hard work of experimentally describing the pair. Just like when citing a reference in a scientific paper the idea is to cite the original work and not citing articles. Hopefully, in the end, authors will make it directly available to Wikidata, maybe with the help of publishers.

re: Wikidata activity of NPImporterBot

Associated bot can be found at:

https://www.wikidata.org/wiki/User:NPImporterBot

On 5 Aug 2021, the page https://www.wikidata.org/wiki/Special:Contributions/NPImporterBot

most recent entries included a single edit from 4 June 2021

https://www.wikidata.org/w/index.php?title=Q107038883&oldid=1434969377

followed by many edits from 1 March 2021

e.g.,

https://www.wikidata.org/w/index.php?title=Q27139831&oldid=1373345585

with many entries added/upserted in early Dec 2020

with earliest entries from 28 Aug 2020

https://www.wikidata.org/w/index.php?title=User:NPImporterBot&oldid=1267066716

It appears that the bot has been inactive since June 2021. Is this expected? When does the bot become active?

This behaviour is expected. The bot becomes active periodically, when we subjectively find there is enough change to run it. We will probably do it on a regular basis or if major (updates of) open databases are (still) released. We waited for the review before running the bot again, so we could easily highlight differences. This has now been done and the difference can be observed between 10.5281/zenodo.5665295 and 10.5281/zenodo.5793668.

re: "The Wikidata importer is available at https://gitlab.com/lotus7/lotus-wikidata-importer. This program takes the processed data resulting from the lotusProcessor subprocess as input and uploads it to Wikidata. It performs a SPARQL query to check which objects already exist. If needed, it creates the missing objects. It then updates the content of each object. Finally, it updates the chemical compound page with a "found in taxon" statement complemented with a "stated in" reference."

How does the wikidata upserter take manual edits into account? Also, if an entry is manually deleted, how does the upserter know the statement was explicitly removed to make sure to now re-add a potentially erroneous entry.

This was answered to another question above.

In attempting to review and reproduce (parts of) workflow using provided documentation, https://gitlab.com/lotus7/lotus-processor/-/blob/7484cf6c1505542e493d3c27c33e9beebacfd63a/README.adoc#user-content-pull-the-repository was found to suggest to run:

git pull https://gitlab.unige.ch/Adriano.Rutz/opennaturalproductsdb.git

However, this command failed with error:

fatal: not a git repository (or any parent up to mount point /home)

Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

Assuming that the documentation meant to suggest to run:

git clone https://gitlab.com/lotus7/lotus-processor,

the subsequent command to retrieve all data also failed:

$ dvc pull

WARNING: No file hash info found for 'data/processed'. It won't be created.

WARNING: No file hash info found for 'data/interim'. It won't be created.

WARNING: No file hash info found for 'data/external'. It won't be created.

WARNING: No file hash info found for 'data/validation'. It won't be created.

4 files failed

ERROR: failed to pull data from the cloud – Checkout failed for following targets:

data/processed

data/interim

data/external

data/validation

Is your cache up to date?

Without a reliable way to retrieve the data products, I cannot review your work and verify your claims of reproducability and versioned data products.

Also, in https://gitlab.com/lotus7/lotus-processor/-/blob/7484cf6c1505542e493d3c27c33e9beebacfd63a/README.adoc#user-content-dataset-list, you claim that the dataset list was captured in xref:docs/dataset.tsv. However, no such file was found. Suspect a broken link with correct version pointing to xref:docs/dataset.csv.

We thank the reviewer for carefully reading, we corrected the broken links. The DVC access was actually an internal one, with no means for non-members to access the data. We removed DVC for the public and made the whole pipeline available with all accessible data. We wrote additional programs to gather all data in a programmatic way: acessible at https://github.com/lotusnprod/lotus-processor/tree/main/src/1_gathering (related commits: https://github.com/lotusnprod/lotus-processor/commit/6cdf56b65b9296eb6fa4f466857dd753c145898c, https://github.com/lotusnprod/lotus-processor/commit/335e83434d20fdacd25505dd0973980756940570, and https://github.com/lotusnprod/lotus-processor/commit/9e93224ca138213bbe24b25f01648fc9be3cd739).

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues (particularly those of Reviewer 2) that need to be addressed, as outlined below:

Reviewer #2 (Recommendations for the authors):

Rutz et al. describe the LOTUS initiative, a database which contains over 750,000 referenced structure-organism pairs. Present both the data which they have made available in Wikidata, as well as their interactive web portal (https://lotus.naturalproducts.net/). provides a powerful platform for mining literature for published data on structure-organism pairs.

The clarity and completeness of the manuscript has improved significantly from the first round to the second round of reviews. Specifically, the authors clarified the scope of the study and provided concrete, and clear examples of how they think the tool should/could be used to advance np research.

I think the authors adequately addressed reviewer concerns regarding the documentation of the work and provide ample documentation on how the LOTUS initiative was built, as well as useful documentation on how it can be used by researchers.

Strengths:

The Lotus initiative is a completely open source, database built on the principles of FAIRness and TRUSTworthiness. Moreover, the authors have laid out their vision of how they hope LOTUS will evolve and grow.

The authors make a significant effort to consider the LOTUS end-user. This is evidenced by the clarity of their manuscript and the completeness of their documentation and tutorials for how to use, cite and add to the LOTUS database.

Weaknesses/Questions:

1) The authors largely addressed my primary concern about the previous version of the manuscript, which was that the scope and the completeness of the LOTUS initiative was not clearly defined. Moreover, by providing concrete examples of how the resource can be used both clarifies the scope of the project and increases the chances that it will be adopted by the great scientific community. In the previous round of reviews, I asked "if LOTUS represents a comprehensive database". In their response, the authors raise the important point that the database will always be limited by the available data in the literature. While I agree with the authors on this point and do not think it takes away from the value of the manuscript/initiative, I think a more nuanced discussion of this point is merited in the paper.

While the authors say that the example queries provided illustrate that the LOTUS can be used to answer research questions, i.e. how many compounds are found in Arabipsis. However, interpreting the results of the query are far more nuanced. For example, can one quantitively compare the number of compounds observed in different species, or families based on the database (e.g. do legumes produce more nitrogen containing compounds than other plant families)? Or do the inherent biases in literature inhibit our ability to draw conclusions such as this? I like that the authors showcase the potential value of using the outlined queries as a starting place for a systematic review, as they highlight in the text. However, I would like to see the authors discuss the inherent limitations with using/interpreting the database queries given the incompleteness of NP exploration.

We thank the reviewer for his appreciation of our response to his first comments regarding the coverage of the database.

We also thank him for pointing again to the limitations of the LOTUS Initiative resource completeness at the moment. We added an in-depth discussion of the possible causes of the incompleteness of these (and other existing) NP resources and also propose some perspectives to address them in the future. See the Data Interpretation paragraph:

https://lotusnprod.github.io/lotus-manuscript/#data-interpretation

2) On page (16?) the authors discuss how the exploration of NP is rather limited, citing the fact that each structure is found on average in only 3 organisms, and that only eleven structures per organism. How do we know that these low numbers are a result of incomplete literature or research efforts and not due to the biology of natural products? For example, plant secondary metabolites are extremely diverse and untargeted metabolomics studies have shown that any given compound is found in only a few species. There is likely no way of knowing the answer to this question, but this section could benefit from a more in-depth discussion.

We thank the reviewer for opening the door for this discussion which is in fact closely related to the previous point.

While we agree on the fact that a wide range of the overall metabolism is heavily specialized, thus that some compounds might indeed be found only in a few species, the untargeted metabolomics studies mentioned by the reviewer also highlight how much remains to be discovered (much more than eleven compounds per organism). This is for now confirmed with heavily studied organisms, where the number of compounds reported is sometimes orders of magnitudes higher.

We thus do not necessarily expect the number of organisms per structure to grow much (except for basal metabolism not much will be shared indeed), but rather the number of structures per organism. Again, we would like to highlight the fact that the habit of only reporting few, new, compounds is highly influenced by editorial guidelines of journals publishing natural products research and is one of the causes of this bias.

One of our next research projects actually stems from these observations. We want to explore in more depth the distribution of core and specialized metabolism across the tree of life using LOTUS data as a starting point. By modeling the distribution of chemical classes across the taxonomy and taking into account priors such as the research effort on given taxa or bias inherent to the chemical classes (alkaloids are easy to fetch through acid/base extraction and sought after for the often potent bioactivities) we expect to disentangle part of the tree of life which are lacking research efforts to be better described from those actually not producing the chemicals.

Here again, see our additional paragraph in the Data Interpretation section to answer the reviewer's justified comment:

https://lotusnprod.github.io/lotus-manuscript/#data-interpretation

3) I do not understand Figure 4. I would like to see more details in the figure legend and an explanation of what an "individual" represents? Are these individual publications? Individual structure-organism pairs? Also, what do the box plots represent? Are they connected to the specific query of Arbidosis thaliana/B-sitosterol? If they are there should only be a value returned for the number of structures in Arabidopsis thaliana, rather than a distribution. Thus, I assume that the accumulation curve and the box plots are unrelated, and this should be spelt out in the figure legend or separated into two different panels.

We improved the figure caption taking the reviewer’s comments into account.

The “number of individuals” (now renamed “number of entries”) represents either a chemical structure or a biological organism (a species), according to the color. Both box plots and curves take all organisms and structures present in LOTUS into account. A. thaliana and KZJWDPNRJALLNS are taken as two notable examples to guide the reader. A. thaliana contains 687 different short inchikeys and KZJWDPNRJALLNS is reported in 3981 distinct organisms. These two examples are thus just a specific case of both the box plot and the curves. They were shifted up in the new figure to avoid confusion.

The accumulation curves and boxplots are related. They are two different visualizations illustrating the same data in different ways, as both have their limitations. The accumulation curves show, for example, that around 80 percent of the structures present in LOTUS are covered with less than 10 percent of the organisms. This could not be seen on the boxplot. On the other hand, it is not possible to see that the median of reported structures per organism is 5 on the accumulation curves. We thank the reviewer again for pointing out the lack of clarity in our caption which should now allow the reader an easier understanding of Figure 4.

4) On page 5 the authors introduce/discuss COCONUT. I am confused about how exactly COCONUT is related to or different from LOTUS. Can you provide some more context, similar to how the other databases were discussed in the preceding paragraph?

We thank the reviewer for this interesting question, and we are happy to bring some clarifications both here and in the manuscript. In LOTUS, every NP structure is strictly associated with at least one organism known to produce it, and this association is always documented, which makes LOTUS a highly curated resource. On the opposite, COCONUT contains all known natural product structures, regardless of their documentation status or producer information, and with variable annotation quality. These annotation aspects make both databases complementary, but different, in the same way as SwissProt and TrEMBL are for protein data. Figure 5 offers further information regarding the differences and complementarity in terms of biological occurrences coverage between COCONUT and LOTUS

Additionally, to avoid needless maintenance costs, both websites (https://coconut.naturalproducts.net/, https://lotus.naturalproducts.net/) are built on the same domain, using the same framework, thus increasing the visual similarity.

We rephrased and clarified the paragraph defining the role of COCONUT in the LOTUS Initiative in the Introduction section.

https://doi.org/10.7554/eLife.70780.sa2

Article and author information

Author details

  1. Adriano Rutz

    1. School of Pharmaceutical Sciences, University of Geneva, Geneva, Switzerland
    2. Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, Geneva, Switzerland
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Wikidata, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0443-9902
  2. Maria Sorokina

    Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Jena, Germany
    Contribution
    LNPN Website, Software, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9359-7149
  3. Jakub Galgonek

    Institute of Organic Chemistry and Biochemistry of the CAS, Prague, Czech Republic
    Contribution
    Sachem, IDSM, Software, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7038-544X
  4. Daniel Mietchen

    1. Ronin Institute, Montclair, United States
    2. Leibniz Institute of Freshwater Ecology and Inland Fisheries, Berlin, Germany
    3. School of Data Science, University of Virginia, Charlottesville, United States
    Contribution
    Wikidata, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9488-1870
  5. Egon Willighagen

    Department of Bioinformatics-BiGCaT, Maastricht University, Maastricht, Netherlands
    Contribution
    Wikidata, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7542-0286
  6. Arnaud Gaudry

    1. School of Pharmaceutical Sciences, University of Geneva, Geneva, Switzerland
    2. Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, Geneva, Switzerland
    Contribution
    Visualization, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3648-7362
  7. James G Graham

    1. Center for Natural Product Technologies and WHO Collaborating Centre for Traditional Medicine (WHO CC/TRM), Pharmacognosy Institute; College of Pharmacy, University of Illinois at Chicago, Chicago, United States
    2. Department of Pharmaceutical Sciences, College of Pharmacy, University of Illinois at Chicago, Chicago, United States
    Contribution
    NAPRALERT, Writing – review and editing
    Competing interests
    No competing interests declared
  8. Ralf Stephan

    Ontario Institute for Cancer Research (OICR), University Ave Suite, Toronto, Canada
    Contribution
    Wikidata, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4650-631X
  9. Roderic Page

    University of Glasgow, Glasgow, United Kingdom
    Contribution
    Wikidata
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7101-9767
  10. Jiří Vondrášek

    Institute of Organic Chemistry and Biochemistry of the CAS, Prague, Czech Republic
    Contribution
    Funding acquisition, Resources, Sachem, IDSM
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6066-973X
  11. Christoph Steinbeck

    Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Jena, Germany
    Contribution
    Funding acquisition, LNPN Website, Resources, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6966-0814
  12. Guido F Pauli

    1. Center for Natural Product Technologies and WHO Collaborating Centre for Traditional Medicine (WHO CC/TRM), Pharmacognosy Institute; College of Pharmacy, University of Illinois at Chicago, Chicago, United States
    2. Department of Pharmaceutical Sciences, College of Pharmacy, University of Illinois at Chicago, Chicago, United States
    Contribution
    Funding acquisition, NAPRALERT, Resources, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1022-4326
  13. Jean-Luc Wolfender

    1. School of Pharmaceutical Sciences, University of Geneva, Geneva, Switzerland
    2. Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, Geneva, Switzerland
    Contribution
    Funding acquisition, Resources, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0125-952X
  14. Jonathan Bisson

    1. Center for Natural Product Technologies and WHO Collaborating Centre for Traditional Medicine (WHO CC/TRM), Pharmacognosy Institute; College of Pharmacy, University of Illinois at Chicago, Chicago, United States
    2. Department of Pharmaceutical Sciences, College of Pharmacy, University of Illinois at Chicago, Chicago, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, NAPRALERT, Project administration, Resources, Software, Supervision, Validation, Writing – review and editing
    For correspondence
    bjo@uic.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1640-9989
  15. Pierre-Marie Allard

    1. School of Pharmaceutical Sciences, University of Geneva, Geneva, Switzerland
    2. Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, Geneva, Switzerland
    3. Department of Biology, University of Fribourg, Fribourg, Switzerland
    Contribution
    Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Wikidata, Writing – original draft, Writing – review and editing
    For correspondence
    pierre-marie.allard@unifr.ch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3389-2191

Funding

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (CRSII5_189921)

  • Adriano Rutz
  • Jean-Luc Wolfender
  • Pierre-Marie Allard

Office of Dietary Supplements (P50 AT000155)

  • James G Graham
  • Guido F Pauli
  • Jonathan Bisson

Deutsche Forschungsgemeinschaft (239748522)

  • Maria Sorokina
  • Christoph Steinbeck

Alfred P. Sloan Foundation (G-2019-11458)

  • Daniel Mietchen
  • Egon Willighagen

National Center for Complementary and Integrative Health (U41 AT008706)

  • James G Graham
  • Jonathan Bisson
  • Guido F Pauli

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

AR, JLW, and PMA are thankful to the Swiss National Science Foundation for supporting part of this project through the SNF Sinergia grant CRSII5_189921. JB and AR are really thankful to JetBrains for the Free educational license of IntelliJ and the excellent support received on Youtrack. JB, JGG, and GFP gratefully acknowledge the support of this work by grant U41 AT008706 and supplemental funding to P50 AT000155 from NCCIH and ODS of the NIH. MS and CS are supported by the German Research Foundation within the framework ChemBioSys (Project-ID 239748522, SFB 1127). EW and DM acknowledge the Scholia grant from the Alfred P Sloan Foundation under grant number G-2019–11458. The work on the Wikidata IDSM/Sachem endpoint was supported by an ELIXIR CZ research infrastructure project grant (MEYS Grant No: LM2018131) including access to computing and storage facilities. The authors would like to thank Dmitry Mozzherin for his work done for the Global Names Architecture and related improvements. They would like to acknowledge help from the PubChem team in integrating LOTUS, especially Tiejun Cheng and Evan Bolton. The authors would also like to thank Layla Michán for starting to add pigment information on Wikidata. The authors would also like to thank the team behind Manubot (Himmelstein et al., 2019), used to write this manuscript. The authors would also like to thank contributors of all electronic NP resources used in this work and the NP community at large.

Senior Editor

  1. Anna Akhmanova, Utrecht University, Netherlands

Reviewing Editor

  1. David A Donoso, Escuela Politécnica Nacional, Ecuador

Reviewer

  1. Charles Tapley Hoyt

Publication history

  1. Preprint posted: March 1, 2021 (view preprint)
  2. Received: May 28, 2021
  3. Accepted: March 22, 2022
  4. Version of Record published: May 26, 2022 (version 1)

Copyright

© 2022, Rutz et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,864
    Page views
  • 341
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Adriano Rutz
  2. Maria Sorokina
  3. Jakub Galgonek
  4. Daniel Mietchen
  5. Egon Willighagen
  6. Arnaud Gaudry
  7. James G Graham
  8. Ralf Stephan
  9. Roderic Page
  10. Jiří Vondrášek
  11. Christoph Steinbeck
  12. Guido F Pauli
  13. Jean-Luc Wolfender
  14. Jonathan Bisson
  15. Pierre-Marie Allard
(2022)
The LOTUS initiative for open knowledge management in natural products research
eLife 11:e70780.
https://doi.org/10.7554/eLife.70780

Further reading

    1. Computational and Systems Biology
    2. Genetics and Genomics
    Jayashree Kumar et al.
    Research Article Updated

    Splicing is highly regulated and is modulated by numerous factors. Quantitative predictions for how a mutation will affect precursor mRNA (pre-mRNA) structure and downstream function are particularly challenging. Here, we use a novel chemical probing strategy to visualize endogenous precursor and mature MAPT mRNA structures in cells. We used these data to estimate Boltzmann suboptimal structural ensembles, which were then analyzed to predict consequences of mutations on pre-mRNA structure. Further analysis of recent cryo-EM structures of the spliceosome at different stages of the splicing cycle revealed that the footprint of the Bact complex with pre-mRNA best predicted alternative splicing outcomes for exon 10 inclusion of the alternatively spliced MAPT gene, achieving 74% accuracy. We further developed a β-regression weighting framework that incorporates splice site strength, RNA structure, and exonic/intronic splicing regulatory elements capable of predicting, with 90% accuracy, the effects of 47 known and 6 newly discovered mutations on inclusion of exon 10 of MAPT. This combined experimental and computational framework represents a path forward for accurate prediction of splicing-related disease-causing variants.

    1. Computational and Systems Biology
    Mayank Baranwal et al.
    Research Article

    Predicting the dynamics and functions of microbiomes constructed from the bottom-up is a key challenge in exploiting them to our benefit. Current models based on ecological theory fail to capture complex community behaviors due to higher order interactions, do not scale well with increasing complexity and in considering multiple functions. We develop and apply a long short-term memory (LSTM) framework to advance our understanding of community assembly and health-relevant metabolite production using a synthetic human gut community. A mainstay of recurrent neural networks, the LSTM learns a high dimensional data-driven non-linear dynamical system model. We show that the LSTM model can outperform the widely used generalized Lotka-Volterra model based on ecological theory. We build methods to decipher microbe-microbe and microbe-metabolite interactions from an otherwise black-box model. These methods highlight that Actinobacteria, Firmicutes and Proteobacteria are significant drivers of metabolite production whereas Bacteroides shape community dynamics. We use the LSTM model to navigate a large multidimensional functional landscape to design communities with unique health-relevant metabolite profiles and temporal behaviors. In sum, the accuracy of the LSTM model can be exploited for experimental planning and to guide the design of synthetic microbiomes with target dynamic functions.