A framework for community curation of interspecies interactions literature
Abstract
The quantity and complexity of data being generated and published in biology has increased substantially, but few methods exist for capturing knowledge about phenotypes derived from molecular interactions between diverse groups of species, in such a way that is amenable to data-driven biology and research. To improve access to this knowledge, we have constructed a framework for the curation of the scientific literature studying interspecies interactions, using data curated for the Pathogen–Host Interactions database (PHI-base) as a case study. The framework provides a curation tool, phenotype ontology, and controlled vocabularies to curate pathogen–host interaction data, at the level of the host, pathogen, strain, gene, and genotype. The concept of a multispecies genotype, the ‘metagenotype,’ is introduced to facilitate capturing changes in the disease-causing abilities of pathogens, and host resistance or susceptibility, observed by gene alterations. We report on this framework and describe PHI-Canto, a community curation tool for use by publication authors.
Editor's evaluation
Focused on host-pathogen interactions, this valuable study presents a useful resource for unifying language(s) and rules used in biology experiments, with a new ontology and tool called PHI-Canto. The framework enables using UniProtKB IDs to curate proteins and eventually derive 'metagenotypes', an important concept that may incidentally help shrinking proliferating names and acronyms for genes, processes, and interactions. This important framework builds on established standards and methods and was rigorously tested with a variety of publications, providing a system that may eventually capture complex information hidden in the data, such as metagenotypes.
https://doi.org/10.7554/eLife.84658.sa0eLife digest
The increasingly vast amount of data being produced in research communities can be difficult to manage, making it challenging for both humans and computers to organise and connect information from different sources. Currently, software tools that allow authors to curate peer-reviewed life science publications are designed solely for single species, or closely related species that do not interact.
Although most research communities are striving to make their data FAIR (Findable, Accessible, Interoperable and Reusable), it is particularly difficult to curate detailed information based on interactions between two or more species (interspecies), such as pathogen-host interactions. As a result, there was a lack of tools to support multi-species interaction databases, leading to a reliance on labour-intensive curation methods.
To address this problem, Cuzick et al. used the Pathogen-Host Interactions database (PHI-base), which curates knowledge from the text, tables and figures published in over 200 journals, as a case study. A framework was developed that could capture the many observable traits (phenotype annotations) for interactions and link them directly to the combination of genotypes involved in those interactions across multiple scales – ranging from microscopic to macroscopic. This demonstrated that it was possible to build a framework of software tools to enable curation of interactions between species in more detail than had been done before.
Cuzick et al. developed an online tool called PHI-Canto that allows any researcher to curate published pathogen-host interactions between almost any known species. An ontology – a collection of concepts and their relations – was created to describe the outcomes of pathogen-host interactions in a standardised way. Additionally, a new concept called the ‘metagenotype’ was developed which represents the combination of a pathogen and a host genotype and can be easily annotated with the phenotypes arising from each interaction.
The newly curated multi-species FAIR data on pathogen-host interactions will enable researchers in different disciplines to compare and contrast interactions across species and scales. Ultimately, this will assist the development of new approaches to reduce the impact of pathogens on humans, livestock, crops and ecosystems with the aim of decreasing disease while increasing food security and biodiversity. The framework is potentially adoptable by any research community investigating interactions between species and could be adapted to explore other harmful and beneficial interspecies interactions.
Introduction
Recent technological advancements across the biological sciences have resulted in an increasing volume of peer-reviewed publications reporting experimental data and conclusions. To increase the value of this highly fragmented knowledge, biocurators manually extract the data from publications and represent it in a standardized and interconnected way following the FAIR (Findable, Accessible, Interoperable, and Reusable) Data Principles (International Society for Biocuration, 2018; Wilkinson et al., 2016). Curated functional data is then made available in online databases, either organism- or clade-specific (e.g. model organism databases) or those supporting multiple kingdoms of life (e.g. PHI-base Urban et al., 2022), Alliance of Genomes Resources (Agapite et al., 2020), or UniProt (Bateman et al., 2021). Due to the complexity of the biology and the specificity of the curation requirements, manual biocuration is currently the most reliable way to capture information about function and phenotype in databases and knowledge bases (Wood et al., 2022). For pathogen–host interactions, the original publications do not provide details of specific strains, variants, and their associated genotypes and phenotypes, nor the relative impact on pathogenicity and virulence, in a standardized machine-readable format. The expert curator synergizes knowledge from different representations (text, graphs, images) into clearly defined machine-readable syntax. The development of curation tools with clear workflows supporting the use of biological ontologies and controlled vocabularies has standardized curation efforts, reduced ambiguity in annotation, and improved the maintenance of the curated corpus as biological knowledge evolves (International Society for Biocuration, 2018).
The pathogen–host interaction research communities are an example of a domain of the biological sciences exhibiting a literature deluge (Figure 1). PHI-base (phi-base.org) is an open-access FAIR biological database containing data on bacterial, fungal, and protist genes proven to affect (or not to affect) the outcome of pathogen–host interactions (Rodríguez-Iglesias et al., 2016; Urban et al., 2020; Urban et al., 2022). Since 2005, PHI-base has manually curated phenotype data associated with underlying genome-level changes from peer-reviewed pathogen–host interaction literature. Information is also provided on the target sites of some anti-infective chemistries (Urban et al., 2020). Knowledge related to pathogen–host interaction phenotypes is increasingly relevant, as infectious microbes continually threaten global food security, human health across the life course, farmed animal health and wellbeing, tree health, and ecosystem resilience (Brown et al., 2012; Fisher et al., 2018; Fisher et al., 2012; Fisher et al., 2022; Smith et al., 2019). Rising resistance to antimicrobial compounds, increased globalization, and climate change indicate that infectious microbes will present ever-greater economic and societal threats (Bebber et al., 2013; Chaloner et al., 2021; Cook et al., 2021). In order to curate relevant publications into PHI-base (version 4), professional curators have, since 2011, entered 81 different data types into a text file (Urban et al., 2017). However, increasing publication numbers and data complexity required more robust curation procedures and greater involvement from publication authors.
We were unable to locate any curation frameworks or tools capable of capturing the interspecies interactions required for PHI-base. PomBase, the fission yeast (Schizosaccharomyces pombe) database developed Canto, a web-based tool supporting curation by both professional biocurators and publication authors (Rutherford et al., 2014). Canto already had support for annotating genes from multiple species in the same curation session, but it could not support annotation of the interactions between species, nor the annotation of genes from naturally occurring strains. We extended and customized Canto to support the annotation of multiple strains of multiple species, and the modeling and annotation of interspecies interactions between pathogens and hosts, to create a new tool: PHI-Canto (the Pathogen–Host Interaction Community Annotation Tool). Likewise, there were no existing biomedical ontologies that could accurately describe pathogen–host interaction phenotypes at the depth and breadth required for PHI-base. Infectious disease formation depends on a series of complex and dynamic interactions between pathogenic species and their potential hosts, and also requires the correct biotic and/or abiotic environmental conditions (Scholthof, 2007), as illustrated by the concept of the ‘disease triangle’ (Figure 2). All these interrelated factors must be recorded in order to sufficiently describe a pathogen–host interaction.
In this study, three key issues were addressed in order to develop the curation framework for interspecies interactions: first, to support the classification of genes as ‘pathogen’ or ‘host,’ and enable the variations of the same gene in different strains to be captured; second, formulating the concept of a ‘metagenotype’ to represent the interaction between specific strains of both a pathogen and a host within a multispecies genotype; and thirdly, developing supporting ontologies and controlled vocabularies, including the generic Pathogen–Host Interaction Phenotype Ontology (PHIPO), to annotate phenotypes connected to genotypes at the level of a single species (pathogen or host) and multiple species (pathogen–host interaction phenotypes). Leading on from these advances, we discuss how the overall curation framework described herein, the concept of annotating metagenotypes, and ongoing generic ontology development, is a suitable approach for adoption and use by a wide range of research communities in the life sciences focused on different types of interspecies interactions occurring within or across kingdoms in different environments and at multiple (micro to macro) scales.
Results
Enabling multispecies curation with UniProtKB accessions
In any curation context, stable identifiers are required for annotated entities. The UniProt Knowledgebase (UniProtKB) (Bateman et al., 2021) is universally recognized, provides broad taxonomic protein coverage, and manually curates standard nomenclature across protein families. Protein sequences are both manually and computationally annotated in UniProtKB, providing a wealth of data on catalytic activities, protein structures, and protein–protein interactions, Gene Ontology (GO) annotations, and links to PHI-base phenotypes (Ashburner et al., 2000; Carbon et al., 2021; Urban et al., 2022). To improve interoperability with other resources, we used UniProtKB accession numbers for retrieving protein entities, gene names, and species information for display in PHI-Canto. PHI-Canto accesses the UniProtKB API to automatically retrieve the entities and their associated data.
Developing the metagenotype to capture interspecies interactions
To enable the annotation of interspecies interactions, we developed the concept of a ‘metagenotype,’ which represents the combination of a pathogen genotype and a host genotype (Figure 3). A metagenotype is created after the individual genotypes from both species are created. Each metagenotype can be annotated with pathogen–host interaction phenotypes to capture changes in pathogenicity (caused by alterations to the pathogen) and changes in virulence (caused by alterations to the host and/or the pathogen). Pathogenicity is a property of the pathogen that describes the ability of the pathogen to cause an infectious disease in another organism. When a pathogenic organism causes disease, the severity of the disease that occurs is referred to as ‘virulence’ and this can also be dependent upon the host organism. Metagenotypes must always include at least one named pathogen gene with a genotype of interest, but need not include a host gene if none is referenced in a given experiment: instead, the wild-type host species and strain may be used for the host part of the metagenotype.
Annotation types and annotation extensions in PHI-Canto
In PHI-Canto, ‘annotation’ is the task of relating a specific piece of knowledge to a biological feature. Three types of biological features can be annotated in PHI-Canto: genes, genotypes, and metagenotypes. Genotypes can be further specified as pathogen genotypes or host genotypes. Each of these biological features has a corresponding set of annotation types. The relation between biological features, annotation types, and the values that can be used for annotation are shown in Table 1. To capture additional biologically relevant information associated with an annotation, curators use the concept of annotation extensions (which include Gene Ontology annotations described by Huntley et al., 2014) to extend the primary annotation. For Canto and PHI-Canto, the meaning of ‘annotation extension’ was broadened to capture additional properties related to the annotation, such as the metagenotype used as an experimental control. The aforementioned additional properties are simply referred to as ‘annotation extensions (AEs)’ in this study (Table 1, Supplementary file 1 and Supplementary file 2). Descriptions of the new AEs for PHI-Canto and the core collection of AEs from Canto are available in the PHI-Canto user documentation (see the Code availability section).
Metagenotypes can be annotated with terms from an ontology or controlled vocabulary following either the ‘pathogen–host interaction phenotype,’ ‘gene-for-gene phenotype,’ or ‘disease name’ annotation types (Table 1). Phenotype annotations on metagenotypes can be supported by AEs providing additional qualifying information required to fully interpret the experiment, such as the infected tissue of the host.
Phenotypes can also be curated for single-species experiments involving either the pathogen or host, following the ‘single species phenotype’ annotation workflow (Table 1). Single species phenotype annotations have a selection of AEs available, including the protein assayed in the experiment and the severity of the observed phenotype (see an example from PMID:22314539 in Appendix 1).
PHI-Canto also supports the annotation of gene and gene product attributes to represent the evolved functional role of a gene product, described here as the ‘gene annotation’ workflow (Table 1). The Gene Ontology is used for the annotation of a gene product’s molecular functions, biological processes, and cellular components, while PSI-MOD is used for the annotation of protein modifications (Montecchi-Palazzi et al., 2008), and BioGRID experiment types are used to capture genetic and physical interactions (Oughtred et al., 2021). GO annotations are submitted to the EBI GO Annotation Database (GOA), from where they are propagated to the main GO knowledge base (Carbon et al., 2021; Huntley et al., 2015).
Trial curation of interspecies interaction publications
Ten publications covering a wide range of typical plant, human, and animal pathogen–host interactions were selected for trial curation in PHI-Canto before the tool was made available to publication authors and communities to add further publications (Table 2). These publications included experiments with early-acting pathogen virulence proteins, the first host targets of pathogen effectors, and resistance to antifungal chemistries. These publications guided the development of the ontology terms and controlled vocabulary terms that were required for PHI-Canto, as well as the curation methods required for different experiments. Major curation problems and their solutions are summarized in Table 3, and example annotations are described below and in Appendix 1 and Appendix 2.
Curating an experiment with a metagenotype
A large proportion of the curation in PHI-Canto requires the use of metagenotypes: one of the simpler cases involves early-acting virulence proteins, where a genetically modified pathogen is inoculated onto a host (without a host gene being specified). A metagenotype is created to connect the genotypes of both species and is annotated with a phenotype term. These experiments are curated following the ‘pathogen–host interaction phenotype’ workflow, including any relevant AEs (Table 1). This two-step curation process is illustrated by PMID:29020037 curation (Table 2, Appendix 1 and Appendix 2) where the GT2 gene is deleted from the fungal plant pathogen Zymoseptoria tritici and inoculated onto wheat plants; the observed phenotype ‘absence of pathogen-associated host lesions’ (PHIPO:0000481) is annotated to the metagenotype; and the AE for ‘infective ability’ is annotated with ‘loss of pathogenicity’ compared to the unaltered pathogen.
Curating pathogen effector experiments
A pathogen effector is defined as an entity transferred between the pathogen and the host that is known or suspected to be responsible for either activating or suppressing a host process commonly involved in defense (Houterman et al., 2009; Jones and Dangl, 2006; Figure 2). To curate an effector experiment, a metagenotype is created and annotated with a phenotype term. To indicate that the pathogen gene functions as an effector, it is necessary to make a concurrent gene annotation (Table 1) with the GO biological process term ‘effector-mediated modulation of host process’ (GO:0140418) or an appropriate descendant term. This GO term (GO:0140418) and its descendant terms were created in collaboration with the Gene Ontology Consortium (GOC) and are used for pathogen effectors in PHI-base (version 5) (Supplementary file 3). Reported activities of pathogen effectors can also be curated with GO molecular function terms. An example of curation of a pathogen effector experiment is illustrated using PMID:31804478 (Table 2 and Appendix 1) where the pathogen effector Pst_12806 from Puccinia striiformis suppresses pattern-triggered immunity in a tobacco leaf model. Here, the metagenotype is annotated with the phenotype ‘decreased level of host defense-induced callose deposition’ (PHIPO:0001015) and the effector is annotated with ‘effector-mediated suppression of host pattern-triggered immunity’ (GO:0052034). A further experiment demonstrated that the pathogen effector protein was able to bind to the natural host (wheat) protein PetC and inhibit its enzyme activity, resulting in a GO molecular function annotation ‘enzyme inhibitor activity’ (GO:0004857) for Pst_12806, with PetC captured as the target protein (see Appendix 1).
Curating experiments with a gene-for-gene relationship
For a gene-for-gene pathogen–host interaction type, the ‘gene-for-gene phenotype’ metagenotype workflow is followed (a gene-for-gene interaction is when a known genetic interaction is conferred by a specific pathogen avirulence gene product and its cognate host resistance gene product) (Figure 2c and d, further described in the figure legend Flor, 1956; Jones and Dangl, 2006; Kanyuka et al., 2022). The metagenotypes and phenotype annotations are made in the same way as the standard ‘pathogen–host interaction phenotype’ workflow, but with different supporting data. A new AE was created to indicate the following three components of the interaction: (i) the compatibility of the interaction, (ii) the functional status of the pathogen gene, and (iii) the functional status of the host gene. An example of an annotation for a biotrophic pathogen gene-for-gene interaction has been illustrated with PMID:20601497 (Table 2 and Appendix 1). Inverse gene-for-gene relationships occur with necrotrophic pathogens, where the pathogen necrotrophic effector interacts with a gene product from the corresponding host susceptibility locus and activates a host response that benefits the pathogen (a compatible interaction). If the necrotrophic effector cannot interact with the host target, then no disease occurs (an incompatible interaction) (Breen et al., 2016). An example of an inverse gene-for-gene interaction using the appropriate AEs is illustrated with PMID:22241993 (Table 2 and Appendix 1).
Curating an experiment with a single species genotype in the presence or absence of a chemical
Single species genotypes (pathogen or host) can also be annotated with phenotypes following the ‘single species phenotype’ workflow (Table 1). This is illustrated using PMID:22314539 in Table 2 (and Appendix 1) with an example of an in vitro pathogen chemistry phenotype, where a single nucleotide mutation in the Aspergillus flavus cyp51C gene confers ‘resistance to voriconazole’ (PHIPO:0000590), an antifungal agent.
Supporting curation of legacy information
PHI-Canto’s curation workflows maintain support for nine high-level terms that describe phenotypic outcomes essential for taxonomically diverse interspecies comparisons, which were the primary annotation method used in previous versions of PHI-base (Urban et al., 2015) and which are displayed in the Ensembl Genomes browser (Yates et al., 2022). For example, the ‘infective ability’ AE can be used to annotate the following subset of high-level terms: ‘loss of pathogenicity,’ ‘unaffected pathogenicity,’ ‘reduced virulence,’ ‘increased virulence,’ and ‘loss of mutualism’ (formerly ‘enhanced antagonism’). The mapping between the nine high-level terms and the PHI-Canto curation process is further described in Supplementary file 3.
Resolving additional problems with curating complex pathogen–host interactions
Table 3 shows a selection of the problems encountered during the development of PHI-Canto and the solutions we identified: for example, recording the delivery mechanism used within the pathogen–host interaction experiment. New experimental condition terms were developed with a prefix of ‘delivery mechanism’: for example, ‘delivery mechanism: agrobacterium,’ ‘delivery mechanism: heterologous organism,’ and ‘delivery mechanism: pathogen inoculation.’ Another issue encountered was how to record a physical interaction between two proteins from different species, especially for the curation of pathogen effectors and their discovered first host targets. This was resolved by adapting the existing Canto module for curating physical interactions to support two different species.
Development of the Pathogen–Host Interaction Phenotype Ontology and additional data lists
To support the annotation of phenotypes in PHI-Canto, PHIPO was developed. PHIPO is a species-neutral phenotype ontology that describes a broad range of pathogen–host interaction phenotypes. Terms in PHIPO were developed following a pre-compositional approach, where the term names and semantics were composed from existing terms from other ontologies, in order to make the curation process easier. For example, the curator annotates 'resistance to penicillin' (PHIPO:0000692) instead of annotating ‘increased resistance to chemical’ (PHIPO:0000022) and ‘penicillin’ (CHEBI:17334) separately. Terms in PHIPO have logical definitions that follow design patterns from the uPheno ontology (Shefchek et al., 2020), and mapping PHIPO terms to the uPheno patterns is an ongoing effort. These logical definitions provide relations between phenotypes in PHIPO and terms in other ontologies, such as PATO, GO, and ChEBI. PHIPO is available in OWL and OBO formats from the OBO Foundry (Jackson et al., 2021).
PHI-Canto uses additional controlled vocabularies derived from data in PHI-base. To enable PHI-Canto to distinguish between pathogen and host organisms, we extracted a list of >250 pathogen species and >200 host species from PHI-base (Supplementary file 4). A curated list of strain names and their synonyms for the species currently curated in PHI-base was also developed for use in PHI-Canto (Supplementary files 4 and 5). PHI-base uses ‘strain’ as a grouping term for natural pathogen isolates, host cultivars, and landraces, all of which are included in the curated list. The curation of pathogen strain designations was motivated by the NCBI Taxonomy’s decision to discontinue the assignment of strain-level taxonomic identifiers (Federhen et al., 2014) and a lack of standardized nomenclature for natural isolates of non-model species. New strain designations can be requested by curators and are reviewed by an expert prior to inclusion to ensure that each describes a novel strain designation rather than a new synonym for an existing strain.
Annotations in PHI-Canto include experimental evidence, which is specified by a term from a subset of the Evidence & Conclusion Ontology (ECO) (Giglio et al., 2019). Experimental evidence codes specific to pathogen–host interaction experiments have been developed and submitted to ECO. Phenotype annotations also include experimental conditions that are relevant to the experiment being curated, which are sourced from the PHI-base Experimental Conditions Ontology (PHI-ECO).
PHI-Canto includes a ‘disease name’ annotation type (Table 1) for annotating the name of the disease caused by an interaction between the pathogen and host specified in a wild-type metagenotype (this annotation type is described in the PHI-Canto user documentation and in Appendix 2). Diseases are specified by a controlled vocabulary of disease names (called PHIDO), which was derived from disease names curated in previous versions of PHI-base (Urban et al., 2022). PHIDO was developed as a placeholder to allow disease names to be annotated on a wide variety of pathogen interactions, including those on plant, human, animal, and invertebrate hosts, especially where such diseases were not described in any existing ontology.
Summary of the PHI-Canto curation process
The PHI-Canto curation process is outlined in Figure 4, Figure 4—figure supplement 1, the PHI-Canto user documentation, a detailed worked example provided in Appendix 2 and curation tutorials on the PHI-base YouTube channel (https://www.youtube.com/@PHI-base), under the playlist ‘PHI-Canto tutorial videos.’ Each curation session is associated with one publication (using its PubMed identifier). One or more curators can collaborate on curating the same publication. An instructional email is sent by PHI-Canto to curators when they begin a new curation session, and PHI-base provides further guidelines on what information is needed to curate a publication in PHI-Canto (Figure 4—figure supplement 2) and how to identify UniProtKB accession numbers from reference proteomes (Figure 4—figure supplement 3).
The curator first adds genes from the publication, then creates alleles from genes, genotypes from alleles, and metagenotypes from pathogen and host genotypes. Pathogen genotypes and host genotypes are created on separate pages, that only include genes from the relevant pathogen or host. A genotype can consist of multiple alleles, therefore, a metagenotype can contain multiple alleles from both the pathogen and the host. A ‘copy and edit’ feature allows the creation of multiple similar annotations.
To make annotations, the curator selects a gene, genotype, or metagenotype to annotate, then selects a term from a controlled vocabulary, adds experimental evidence, experimental conditions, AEs (where available), and any additional comments. The curator can also specify a figure or table number from the original publication as part of the annotation. Curators can use a term suggestion feature to suggest new terms for any controlled vocabulary used by PHI-Canto, and experimental conditions can be entered as free text if no suitable condition is found in PHI-ECO. Subsequently, new condition suggestions are reviewed and approved by expert curators. The curation session can be saved and paused at various stages during the curation process. Once the curation process is complete, the curator submits the session for review by a nominated species expert.
Display and interoperability of data
The process of incorporating FAIR principles fully into the PHI-base curation process will promote interoperability between data resources (Wilkinson et al., 2016). Figure 5 illustrates the internal and external resource dependencies for curation in PHI-Canto. URLs and descriptions of the use of each resource are provided in Figure 5—figure supplement 1. All data curated in PHI-Canto will be displayed in the new gene-centric version 5 of PHI-base, introduced in Urban et al., 2022. Additional detail on the data types displayed in PHI-base 5 is available in Table 4. Reciprocally, components of the interspecies curation framework (Figure 6a) will provide data to other resources (Figure 6b). For example, GO terms will be used in curation with PHI-Canto and these annotations will be made available in the GO knowledge base via submission to the GOA Database (Carbon et al., 2021; Huntley et al., 2015). PHI-base is a member of ELIXIR, an organization that aims to unite leading life science resources and is a major proponent of FAIR data (Durinx et al., 2016).
Discussion
Scalable and accurate curation of data within the scientific literature is of paramount importance due to the increasing quantity of publications and the complexity of experiments within each publication. PHI-base is an example of a freely available, manually curated database, which has been curating literature using professional curators since 2005 (Winnenburg et al., 2006).
Here, we have described the development of PHI-Canto to allow the curation of the interspecies pathogen–host interaction literature by professional curators and publication authors. This curated data is then made available on the new gene-centric version 5 of PHI-base, where all information (i.e. new and existing) on a single gene from several publications is presented on a single page, with links to external resources providing information on interacting genes, proteins, and other entities.
Several adaptations to the original single-species community annotation tool, Canto (Rutherford et al., 2014), were required to convert this tool for interspecies use. Notably, the need to annotate an interaction involving two different organisms necessitated the development of a novel concept, the ‘metagenotype’ (Figure 3), in order to record a combined experimental genotype involving both a pathogen and a host. This is, to our knowledge, the first example of such an approach to interspecies interaction curation.
Curation of pathogen–host interactions in PHI-Canto also necessitated the development of a new phenotype ontology (PHIPO) to annotate pathogen–host interaction phenotypes in sufficient detail across the broad range of host species that were curated in PHI-base (n=234 in version 4.14 of PHI-base). The functional annotation of genes involved in interspecies interactions is a complex and challenging task, requiring ongoing modifications to the Gene Ontology and occasional major refactoring to deprecate legacy terms (Carbon et al., 2021). PHIPO development and maintenance will also be an ongoing task, with both authors and professional curators requesting new terms and edits to existing terms and the ontology structure. Maintenance will be made more sustainable by the incorporation of logical definitions that are aligned across phenotype ontologies in collaboration with the uPheno project (Shefchek et al., 2020).
To improve the efficiency of the curation process, we are suggesting that authors follow an author checklist during manuscript preparation (Appendix 3). This will improve the presentation of key information (e.g. species names, gene identifiers, etc.) in published manuscripts, thus enabling more efficient and comprehensive curation that is human- and machine-readable. The annotation procedures described here using PHI-Canto can be used to extract data buried in small-scale publications and increase the accessibility of the curated article to a wider range of potential users, for example, computational biologists, thereby improving the FAIR status of the data. The current data in PHI-base has been obtained from >200 journals (Figure 7) and, therefore, represents highly fragmented knowledge which is exceptionally difficult to use by professionals in other disciplines. The feasibility of scalable community curation with Canto is evidenced by PomBase (Lock et al., 2020), where Canto S. pombe annotations from over 1000 publications are provided by publication authors, with the data made available within 24 hr of review (https://curation.pombase.org/pombe/stats/annotation).
With regard to our focus on manual curation, we recognize that great progress has been made with machine learning (ML) approaches in recent times. However, Wood et al., 2022 note that the data being curated from publications are ‘categorical, highly complex, and with hundreds of thousands of heterogeneous classes, often not explicitly labeled.’ There are no published examples of ML approaches outperforming an expert curator in accuracy, which is paramount in the medical field. However, curation by experts could provide a highly reliable corpus that could be used for training ML systems. Our aspiration is that ML and expert curators can collaborate in a virtuous cycle whereby expert curators continually review and refine the ML models, while the manual work of finding publications and entity recognition is handled by the ML system.
Our future intentions are twofold: first, a graph-based representation of the data will be enabled by integration with knowledge network generation tools, such as Knetminer (Hassani-Pak et al., 2021), where subgraphs of the knowledge graph could be embedded into each gene-centric page on the PHI-base 5 website. Second, within PHI-Canto, we intend to address the issues associated with maximizing the inherent value of the natural sequence variation between species strains, and the associated altered phenotypic outcomes observed at multiple scales, in different types of interactions and/or environments. PHI-base already contains information on numerous species with multiple experimental strains, and natural sequence variation between strains can result in alterations at the genome level that affect the subsequently observed phenotypes. Strain-specific sequence variation is not captured in the reference proteomes stored by UniProt, even though accession numbers from these proteomes are often used in PHI-Canto. Currently, when a curator enters a gene with a taxonomic identifier below the species rank, PHI-Canto maps the identifier to the corresponding identifier at the species rank (thus removing any strain details from the organism name), and the curator specifies a strain to differentiate gene variants in naturally occurring strains. However, this does not change the taxonomic identifier linked to the UniProtKB accession number (nor its sequence), so the potential for inaccuracy remains. To mitigate this, the future plan is to record the strain-specific sequence of the gene using an accession number from a database from the International Nucleotide Sequence Database Collaboration (Arita et al., 2021).
The release of PHI-Canto to the community will occur gradually through various routes. Community curation will be promoted by working with journals to capture the publication data at the source, at the point of manuscript acceptance. We will also target specific research communities (e.g. those working on a particular pathogen and/or research topic) by inviting authors to curate their own publications. Authors may contact us directly to request support while curating their publications in PHI-Canto.
PHI-Canto, PHI-base, and PHIPO were devised and built over the past seven years to serve the research needs of a specific international research community interested in exploring the wide diversity of common and species-specific mechanisms underlying pathogen attack and host defense in plants, animals, humans, and other host organisms caused by fungi, protists and bacteria. However, it should be noted that the underlying developments to Canto’s data model – especially the concept of annotating metagenotypes – could be of use to communities focused on different types of interspecies interactions. Possible future uses of the PHI-Canto schema could include insect–plant interactions (both beneficial and detrimental), endosymbiotic relationships such as mycorrhiza–plant rhizosphere interactions, nodulating bacteria–plant rhizosphere interactions, fungi–fungi interactions, plant–plant interactions or bacteria–insect interactions, and non-pathogenic relationships in natural environments such as bulk soil, rhizosphere, phyllosphere, air, freshwater, estuarine water or seawater, and human–animal, animal–bird, human–insect, animal–insect, bird–insect interactions in various anatomical locations (e.g. gut, lung, and skin). The schema could also be extended to situations where phenotype–genotype relations have been established for predator–prey relationships or where there is competition in herbivore–herbivore, predator–predator or prey–prey relationships in the air, on land, or in the water. Finally, the schema could be used to explore strain-to-strain interactions within a species when different biological properties have been noted. Customizing Canto to use other ontologies and controlled vocabularies is as simple as editing a configuration file, as shown in Source code 1.
Methods
Changes to the Canto data model and configuration
PHI-Canto stores its data in a series of relational databases using the SQLite database engine. A primary database stores data shared across all curation sessions, and each curation session also has its own database to store data related to a single publication (such as genes, genotypes, metagenotypes, etc.). PHI-Canto can export its data as a JSON file or in more specialized formats, for example, the GO Annotation File (GAF) format.
To implement PHI-Canto several new entities were added to the Canto data model in order to support pathogen–host curation, as well as new configuration options (the new entities are illustrated in Figure 3—figure supplement 1). These entities were ‘strain,’ ‘metagenotype,’ and ‘metagenotype annotation.’ The complete data model for PHI-Canto is illustrated in Figure 3—figure supplements 2 and 3.
Pathogen and host roles
Genotype entities in PHI-Canto’s data model were extended with an attribute indicating their status as a pathogen genotype or a host genotype. Genotypes inherit their status (as pathogen or host) from the organism, which in turn is classified as a pathogen or host based on a configuration file that contains the NCBI Taxonomy ID (taxid) (Schoch et al., 2020) of each host species in PHI-base. Only host taxids need to be specified since PHI-Canto defaults to classifying a species as a pathogen if its taxid is not found in the configuration file.
PHI-Canto also loads lists of pathogen and host species that specify the scientific name, taxid, and common name (if any) of each species. These species lists are used to specify which host species can be added as a component of the metagenotype in the absence of a specific studied gene, and to override the scientific name provided by UniProtKB in favor of the name used by a scientific community studying the species (for example, to control whether the anamorph or teleomorph name of a fungal species is displayed in PHI-Canto’s user interface).
Metagenotype implementation
Metagenotypes were implemented by adding a ‘metagenotype’ entity to PHI-Canto’s data model. The metagenotype is the composition of two genotype entities. We also changed the data model to allow annotations to be related to metagenotypes (previously, only genes and genotypes could be related to annotations).
Strain implementation
Support for strain curation was implemented by adding a ‘strain’ entity to PHI-Canto’s data model. Strains are related to an organism entity and its related genotype entities. In the user interface, PHI-Canto uses the taxid of the organism to filter an autocomplete system, such that only the strains of the specified organism are suggested. The autocomplete system can also use synonyms in the strain list to suggest a strain based on its synonymous names. Unknown strains are represented by a preset value of ‘Unknown strain.’
Ontologies
PHIPO was developed using the Protégé ontology editor (Musen and Team, 2015). PHIPO uses OBO namespaces to allow PHI-Canto to filter the terms in the ontology by annotation type, ensuring that genotypes are annotated with single-species phenotypes and metagenotypes with pathogen–host interaction phenotypes.
PHI-ECO was also developed using Protégé, starting from a list of experimental conditions originally developed by PomBase. PHIDO was initially derived from a list of diseases already curated in PHI-base and is now maintained as a flat file that is converted into an OBO file using ROBOT (Jackson et al., 2019).
Data availability
Pathogen–Host Interaction Phenotype Ontology: http://purl.obolibrary.org/obo/phipo.owl.
PHI-base: Experimental Conditions Ontology: (Cuzick and Seager, 2022a) https://github.com/PHI-base/phi-eco.
PHIDO: the controlled vocabulary of disease names: (Cuzick and Seager, 2022b) https://github.com/PHI-base/phido.
PHIPO Extension Ontology for gene-for-gene phenotypes: (Cuzick and Seager, 2022c) https://github.com/PHI-base/phipo_ext.
Location of species and strain lists used by PHI-Canto: (Cuzick et al., 2022d) https://github.com/PHI-base/data.
PHI-Canto approved curation sessions (December 2022): https://doi.org/10.5281/zenodo.7428788.
Code availability
PHI-Canto’s source code is available on GitHub, at https://github.com/PHI-base/canto, (copy archived at swh:1:rev:dd310334974d9471c1916c0ac080550bfd153707). PHI-Canto is freely licensed under the GNU General Public License version 3, with no restrictions on copying, distributing, or modifying the code, for commercial use or otherwise, provided any derivative works are licensed under the same terms. PHI-base provides an online demo version of PHI-Canto at https://demo-canto.phi-base.org/ which can be used for evaluating the tool. The demo version and the main version of PHI-Canto will remain freely available online.
Canto’s source code is available on GitHub, at https://github.com/pombase/canto, (copy archived at swh:1:rev:2f8fe11c217b52a69251cb589abdf798dab3767b). Canto is also freely licensed under the GNU General Public License version 3.
The source code for PHI-Canto’s user documentation is available on GitHub, at https://github.com/PHI-base/canto-docs, (copy archived at swh:1:rev:a134c04d8fb59769678456fb41d02fd169be7b06). The user documentation is licensed under the MIT license. The published format of the user documentation is available online at https://canto.phi-base.org/docs/index.
The source code for PHIPO is available on GitHub under a Creative Commons Attribution 3.0 license, at https://github.com/PHI-base/phipo, (copy archived at swh:1:rev:fbb0af482869744e085e829c463d4eb0c6afafd2).
Appendix 1
How to use annotation extensions
This file provides information on Annotation Extensions (AE) and how to use them in PHI-Canto to curate a standard selection of experiments (Table 2). The first section provides four examples of using AEs for curating metagenotypes with pathogen-host interaction phenotypes. The second section provides examples of curating metagenotypes using the gene-for-gene phenotype workflow, including using the AEs for gene-for-gene interactions and inverse gene-for-gene interactions. The third section of this file illustrates three examples of using AEs for curating single-species phenotypes.
Further information on how to use PHI-Canto to make annotations can be found in PHI-Canto’s user documentation, available at https://canto.phi-base.org/docs/index.
Contents:
SECTION 1: Annotation Extensions for curating pathogen-host interaction phenotypes on metagenotypes
Section 1A: If you have a metagenotype phenotype recording ‘unaffected pathogenicity’ (corresponds to footnote ‡ in Table 2)
Section 1B: If you have a metagenotype phenotype recording ‘altered pathogenicity or virulence’ (corresponds to footnote § in Table 2)
Section 1C: If you have a metagenotype phenotype recording ‘mutualism’ (corresponds to footnote ** in Table 2)
Section 1D: If you have a metagenotype phenotype recording ‘a pathogen effector’ (corresponds to footnote †† in Table 2)
SECTION 2: Annotation Extensions for curating gene-for-gene phenotypes on metagenotypes
Section 2A: If you have a metagenotype phenotype recording ‘a gene-for-gene interaction’ (corresponds to footnote ‡ ‡ in Table 2)
Section 2B: If you have a metagenotype phenotype recording ‘an inverse gene-for-gene interaction’ (corresponds to footnote ¶ ¶ in Table 2)
SECTION 3: Annotation Extensions for curating single species phenotypes (pathogen phenotypes or host phenotypes)
Section 3A: Example of an in vitro pathogen phenotype (corresponds to footnote ¶ in Table 2)
Section 3B: Example of an in vitro pathogen chemistry phenotype (corresponds to footnote *** in Table 2)
Section 3C: Example of an in vivo host phenotype (corresponds to footnote § § in Table 2)
Section 1: Annotation Extensions for curating pathogen-host interaction phenotypes on metagenotypes
When creating and annotating metagenotypes, it is advisable to also create and annotate a wild-type control metagenotype where possible. This enables a better understanding of annotations made to altered metagenotypes.
(Note: It is also possible to use several of the AEs in the table documenting single species phenotype AEs, e.g. penetrance and affected protein).
Section 1 A: If you have a metagenotype phenotype recording ‘unaffected pathogenicity’ (corresponds to footnote ‡ in Table 2)
Example publication: The RhlR quorum-sensing receptor controls Pseudomonas aeruginosa pathogenesis and biofilm development independently of its canonical homoserine lactone autoinducer (PMID:28715477).
Section 1B: If you have a metagenotype phenotype recording ‘altered pathogenicity or virulence’ (corresponds to footnote § in Table 2)
Example publication: A conserved fungal glycosyltransferase facilitates pathogenesis of plants by enabling hyphal growth on solid surfaces (PMID:29020037).
A training video is available for the curation of this publication at https://youtu.be/44XGoi6Ijqk?t=1738.
Section 1 C: If you have a metagenotype phenotype recording ‘mutualism’ (corresponds to footnote ** in Table 2)
Example publication: Reactive oxygen species play a role in regulating a fungus-perennial ryegrass mutualistic interaction (PMID:16517760).
Section 1D: If you have a metagenotype phenotype recording ‘a pathogen effector’ (corresponds to footnote †† in Table 2)
If you have a biotrophic or necrotrophic plant pathogen effector which is involved in a gene-for-gene interaction, please see the AEs for the ‘gene-for-gene interaction’ or ‘inverse gene-for-gene interaction’ workflow (Section 2).
Annotate the pathogen effector with the GO Biological Process term ‘effector-mediated modulation of host process by symbiont’ (GO:0140418) or a descendant. If the GO Molecular Function term is known, then this can also be annotated and linked to the relevant GO effector term via an annotation extension.
Example publication: An effector protein of the wheat stripe rust fungus targets chloroplasts and suppresses chloroplast function (PMID:31804478).
Altered metagenotype
Section 2: Annotation extensions for curating gene-for-gene phenotypes on metagenotypes
Section 2 A: If you have a metagenotype phenotype recording ‘a gene-for-gene interaction’ (corresponds to footnote ‡ ‡ in Table 2)
Annotate the pathogen effector with the GO Biological process term ‘effector-mediated modulation of host process by symbiont’ (GO:0140418) or a descendant. If the GO Molecular Function term is known, then this can also be annotated and linked to the relevant GO effector term via an annotation extension.
Example publication: Activation of an Arabidopsis resistance protein is specified by the in planta association of its leucine-rich repeat domain with the cognate oomycete effector (PMID:20601497).
Section 2B: If you have a metagenotype phenotype recording ‘an inverse gene-for-gene interaction’ (corresponds to footnote ¶ ¶ in Table 2)
Annotate the pathogen effector with the GO Biological process term ‘effector-mediated modulation of host process by symbiont’ (GO:0140418) or a descendant. If the GO Molecular Function term is known, then this can also be annotated and linked to the relevant GO effector term via an annotation extension.
Example publication: The cysteine-rich necrotrophic effector SnTox1 produced by Stagonospora nodorum triggers susceptibility of wheat lines harboring Snn1 (PMID:22241993).
Section 3: Annotation extensions for curating single species phenotypes (pathogen phenotypes or host phenotypes)
Section 3 A: Example of an in vitro pathogen phenotype (corresponds to footnote ¶ in Table 2)
Example publication: A conserved fungal glycosyltransferase facilitates pathogenesis of plants by enabling hyphal growth on solid surfaces (PMID:29020037).
A training video is available for the curation of this publication at https://youtu.be/44XGoi6Ijqk?t=1738.
Section 3B: Example of an in vitro pathogen chemistry phenotype (corresponds to footnote *** in Table 2)
Example publication: The T788G mutation in the cyp51C gene confers voriconazole resistance in Aspergillus flavus causing aspergillosis. (PMID:22314539).
Section 3 C: Example of an in vivo host phenotype (corresponds to footnote § § in Table 2)
Example publication: Activation of an Arabidopsis resistance protein is specified by the in planta association of its leucine-rich repeat domain with the cognate oomycete effector. (PMID:20601497).
Appendix 2
Worked example of a curation session
This document provides a worked example of the curation process in PHI-Canto for the publication by King et al., 2017, A conserved fungal glycosyltransferase facilitates pathogenesis of plants by enabling hyphal growth on solid surfaces (PMID:29020037).
The research study confirms the hypothesis that the GT2 gene is required for the fungal pathogens Zymoseptoria tritici and Fusarium graminearum to cause disease in wheat (Triticum aestivum). The curation session in PHI-Canto captures this conclusion by annotating a pathogen–host interaction between Z. tritici and T. aestivum to show that deletion of the GT2 gene causes loss of pathogenicity in the pathogen, and an absence of pathogen-associated lesions in the host. The wild-type interaction between Z. tritici and T. aestivum is annotated to indicate the presence of disease (and lesions), and a corresponding pathogen–host interaction between F. graminearum and T. aestivum is annotated to show that deleting GT2 again causes a loss of pathogenicity and the absence of pathogen-associated lesions in the host.
The example starts with the entry of the publication into PHI-Canto (https://canto.phi-base.org/) and ends with the submission of the curation session for review by curators at PHI-base. The information curated from this publication is available on the new gene-centric PHI-base 5 website (http://phi5.phi-base.org, search for PHIG:308 and PHIG:307).
Entering the publication
Specifying genes and species
Specifying strains
Creating alleles and genotypes
Creating metagenotypes for pathogen–host interactions
Annotating pathogen–host interactions with phenotypes
Phenotype and evidence
Annotation extensions
Figure numbers and comments
Copying annotations
Disease annotation
Gene Ontology annotation
Other annotation types
The publication contains other information which is not included in this worked example for the sake of brevity. In the real curation session, this other information is captured as the following annotations:
GO biological process annotations indicate that GT2 is involved in the hyphal growth process.
GO cellular component annotations indicate that GT2 is located in the hyphal cell wall.
Pathogen phenotype annotations capture information about the pathogen in vitro, specifically normal and altered phenotypes for unicellular population growth, hyphal growth, cellular melanin accumulation, filament morphology, and so on.
All these annotation types use the same annotation process as the annotation types described above.
Submitting the curation session
Appendix 3
Author checklist prior to publication
Here, we have developed a list of important points for an author to consider prior to submitting a manuscript for publication. Nine key points are displayed in Appendix 3—table 1.
Data availability
Datasets generated for use within the curation framework are available as GitHub links in the manuscript section 'Data availability'. Code is available as GitHub links in the manuscript section 'Code availability'. PHI-Canto curated data is available here https://doi.org/10.5281/zenodo.7428788.
References
-
Alliance of genome resources portal: unified model organism research platformNucleic Acids Research 48:D650–D658.https://doi.org/10.1093/nar/gkz813
-
The international nucleotide sequence database collaborationNucleic Acids Research 49:D121–D124.https://doi.org/10.1093/nar/gkaa967
-
Gene Ontology: tool for the unification of biologyNature Genetics 25:25–29.https://doi.org/10.1038/75556
-
Uniprot: the universal protein knowledgebase in 2021Nucleic Acids Research 49:D480–D489.https://doi.org/10.1093/nar/gkaa1100
-
Crop pests and pathogens move polewards in a warming worldNature Climate Change 3:985–988.https://doi.org/10.1038/nclimate1990
-
Hidden killers: human fungal infectionsScience Translational Medicine 4:165rv13.https://doi.org/10.1126/scitranslmed.3004404
-
The Gene Ontology resource: enriching a gold mineNucleic Acids Research 49:D325–D334.https://doi.org/10.1093/nar/gkaa1113
-
Plant pathogen infection risk tracks global crop yields under climate changeNature Climate Change 11:710–715.https://doi.org/10.1038/s41558-021-01104-8
-
Identifying ELIXIR core data resourcesF1000Research 5:ELIXIR-2422.https://doi.org/10.12688/f1000research.9656.2
-
Tackling the emerging threat of antifungal resistance to human healthNature Reviews. Microbiology 20:557–571.https://doi.org/10.1038/s41579-022-00720-1
-
BookThe complementary genic systems in flax and flax rustIn: Demerec M, editors. Advances in Genetics. Academic Press. pp. 29–54.
-
ECO, the Evidence & Conclusion Ontology: community standard for evidence informationNucleic Acids Research 47:D1186–D1194.https://doi.org/10.1093/nar/gky1036
-
Knetminer: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across speciesPlant Biotechnology Journal 19:1670–1678.https://doi.org/10.1111/pbi.13583
-
The GOA database: Gene Ontology annotation updates for 2015Nucleic Acids Research 43:D1057–D1063.https://doi.org/10.1093/nar/gku1113
-
Biocuration: distilling data into knowledgePLOS Biology 16:e2002846.https://doi.org/10.1371/journal.pbio.2002846
-
ROBOT: a tool for automating ontology workflowsBMC Bioinformatics 20:407.https://doi.org/10.1186/s12859-019-3002-3
-
The PSI-MOD community standard for representation of protein modification dataNature Biotechnology 26:864–866.https://doi.org/10.1038/nbt0808-864
-
Publishing FAIR data: an exemplar methodology utilizing PHI-baseFrontiers in Plant Science 7:641.https://doi.org/10.3389/fpls.2016.00641
-
Canto: an online tool for community literature curationBioinformatics 30:1791–1792.https://doi.org/10.1093/bioinformatics/btu103
-
The disease triangle: pathogens, the environment and societyNature Reviews. Microbiology 5:152–156.https://doi.org/10.1038/nrmicro1596
-
The pathogen-host interactions database (PHI-base): additions and future developmentsNucleic Acids Research 43:D645–D655.https://doi.org/10.1093/nar/gku1165
-
PHI-base: a new interface and further additions for the multi-species pathogen-host interactions databaseNucleic Acids Research 45:D604–D610.https://doi.org/10.1093/nar/gkw1089
-
PHI-base: the pathogen-host interactions databaseNucleic Acids Research 48:D613–D620.https://doi.org/10.1093/nar/gkz904
-
PHI-base in 2022: a multi-species phenotype database for pathogen-host interactionsNucleic Acids Research 50:D837–D847.https://doi.org/10.1093/nar/gkab1037
-
PHI-base: a new database for pathogen host interactionsNucleic Acids Research 34:D459–D464.https://doi.org/10.1093/nar/gkj047
-
Ensembl Genomes 2022: an expanding genome resource for non-vertebratesNucleic Acids Research 50:D996–D1003.https://doi.org/10.1093/nar/gkab1007
Article and author information
Author details
Funding
Biotechnology and Biological Sciences Research Council (BB/S020020/1)
- Alayne Cuzick
- Martin Urban
- Kim E Hammond-Kosack
- James Seager
Biotechnology and Biological Sciences Research Council (BB/S020098/1)
- Alayne Cuzick
- Martin Urban
- Kim E Hammond-Kosack
Biotechnology and Biological Sciences Research Council (BB/X011003/1)
- Alayne Cuzick
- Martin Urban
- Kim E Hammond-Kosack
Biotechnology and Biological Sciences Research Council (BB/X010953/1)
- Alayne Cuzick
- James Seager
Biotechnology and Biological Sciences Research Council (BB/P016855/1)
- Martin Urban
- Kim E Hammond-Kosack
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank the late post-doctoral PHI-base team member Dr. Alistair Irvine for adding chemical entries to ChEBI. Dr. Paul Kersey, formerly the non-vertebrate Ensembl team leader, is thanked for helpful discussions and ideas on community engagement. We thank Dr. Midori Harris (formerly of the University of Cambridge, UK) for providing valuable input into the development of PHIPO based on her extensive knowledge of FYPO. Dr. Pascale Gaudet (Swiss-Prot, Swiss Institute of Bioinformatics) is thanked for the generation and editing of GO terms involved in interspecies interactions. We also thank Drs. Chris Stephens and Ana Machado-Wood (both formerly of Rothamsted Research) for completing the trial curation of articles into beta versions of PHI-Canto and providing invaluable feedback and suggestions for further improvements. We thank Dr. Melina Velasquez (based at Rothamsted Research) for preparing the PHI-Canto tutorial videos. The Molecular Connections team based in Bangalore India while developing the PHI-base 5 website, provided useful feedback on data interoperability between PHI-Canto and the new gene-centric version of PHI-base. PHI-base is funded by the UK Biotechnology and Biological Sciences Research Council (BBSRC) Grants BB/S020020/1 and BB/S020098/1 and previously by the BBSRC National Capability Grant (2012–2017). Rothamsted authors MU and KHK receive additional BBSRC grant-aided support as part of the Institute Strategic Programme (ISP) Designing Future Wheat Grant (BB/P016855/1) and Delivering Sustainable Wheat (DSW) (BB/X011003/1). In addition, author AC receives BBSRC ISP DSW (BB/X011003/1) support and authors AC and JS receive BBSRC ISP Growing Health (BB/X010953/1) support. This work was conducted using the Protégé resource, which is supported by grant GM10331601 from the National Institute of General Medical Sciences of the United States National Institutes of Health.
Copyright
© 2023, Cuzick et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 492
- views
-
- 55
- downloads
-
- 4
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Ecology
- Microbiology and Infectious Disease
Interspecies interactions involving direct competition via bacteriocin production play a vital role in shaping ecological dynamics within microbial ecosystems. For instance, the ribosomally produced siderophore bacteriocins, known as class IIb microcins, affect the colonization of host-associated pathogenic Enterobacteriaceae species. Notably, to date, only five of these antimicrobials have been identified, all derived from specific Escherichia coli and Klebsiella pneumoniae strains. We hypothesized that class IIb microcin production extends beyond these specific compounds and organisms. With a customized informatics-driven approach, screening bacterial genomes in public databases with BLAST and manual curation, we have discovered 12 previously unknown class IIb microcins in seven additional Enterobacteriaceae species, encompassing phytopathogens and environmental isolates. We introduce three novel clades of microcins (MccW, MccX, and MccZ), while also identifying eight new variants of the five known class IIb microcins. To validate their antimicrobial potential, we heterologously expressed these microcins in E. coli and demonstrated efficacy against a variety of bacterial isolates, including plant pathogens from the genera Brenneria, Gibbsiella, and Rahnella. Two newly discovered microcins exhibit activity against Gram-negative ESKAPE pathogens, i.e., Acinetobacter baumannii or Pseudomonas aeruginosa, providing the first evidence that class IIb microcins can target bacteria outside of the Enterobacteriaceae family. This study underscores that class IIb microcin genes are more prevalent in the microbial world than previously recognized and that synthetic hybrid microcins can be a viable tool to target clinically relevant drug-resistant pathogens. Our findings hold significant promise for the development of innovative engineered live biotherapeutic products tailored to combat these resilient bacteria.
-
- Computational and Systems Biology
- Microbiology and Infectious Disease
Bacterial membranes are complex and dynamic, arising from an array of evolutionary pressures. One enzyme that alters membrane compositions through covalent lipid modification is MprF. We recently identified that Streptococcus agalactiae MprF synthesizes lysyl-phosphatidylglycerol (Lys-PG) from anionic PG, and a novel cationic lipid, lysyl-glucosyl-diacylglycerol (Lys-Glc-DAG), from neutral glycolipid Glc-DAG. This unexpected result prompted us to investigate whether Lys-Glc-DAG occurs in other MprF-containing bacteria, and whether other novel MprF products exist. Here, we studied protein sequence features determining MprF substrate specificity. First, pairwise analyses identified several streptococcal MprFs synthesizing Lys-Glc-DAG. Second, a restricted Boltzmann machine-guided approach led us to discover an entirely new substrate for MprF in Enterococcus, diglucosyl-diacylglycerol (Glc2-DAG), and an expanded set of organisms that modify glycolipid substrates using MprF. Overall, we combined the wealth of available sequence data with machine learning to model evolutionary constraints on MprF sequences across the bacterial domain, thereby identifying a novel cationic lipid.