Unifying the known and unknown microbial coding sequence space

  1. Chiara Vanni
  2. Matthew S Schechter
  3. Silvia G Acinas
  4. Albert Barberán
  5. Pier Luigi Buttigieg
  6. Emilio O Casamayor
  7. Tom O Delmont
  8. Carlos M Duarte
  9. A Murat Eren
  10. Robert D Finn
  11. Renzo Kottmann
  12. Alex Mitchell
  13. Pablo Sánchez
  14. Kimmo Siren
  15. Martin Steinegger
  16. Frank Oliver Gloeckner
  17. Antonio Fernàndez-Guerra  Is a corresponding author
  1. Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Germany
  2. Jacobs University Bremen, Germany
  3. Department of Medicine, University of Chicago, United States
  4. Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC), Spain
  5. Department of Environmental Science, University of Arizona, United States
  6. Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Alfred Wegener Institute, Germany
  7. Center for Advanced Studies of Blanes CEAB-CSIC, Spanish Council for Research, Spain
  8. Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, France
  9. Red Sea Research Centre and Computational Bioscience Research Center, King Abdullah University of Science and Technology, Saudi Arabia
  10. Josephine Bay Paul Center, Marine Biological Laboratory, United States
  11. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, United Kingdom
  12. Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Denmark
  13. School of Biological Sciences, Seoul National University, Republic of Korea
  14. Institute of Molecular Biology and Genetics, Seoul National University, Republic of Korea
  15. University of Bremen and Life Sciences and Chemistry, Germany
  16. Computing Center, Helmholtz Center for Polar and Marine Research, Germany
  17. Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Denmark
25 figures, 33 tables and 3 additional files

Figures

Conceptual framework to unify the known and unknown sequence space and integration of the framework in the current analytical workflows.

(A) Link between the conceptual framework and the computational workflow to partition the sequence space in the four conceptual categories. AGNOSTOS infers, validates and refines the GCs and combines them in gene cluster communities (GCCs). Then, it classifies them in one of the four conceptual categories based on their level of ‘darkness’. Finally, we add context to each GC based on several sources of information, providing a robust framework for generating hypotheses that can be used to augment experimental data. (B) The computational workflow provides two mechanisms to structure sequence space using GCs, de novo creation of the GCs (DB creation), or integrating the dataset in an existing GC database (DB update). The structured sequence space can then be plugged into traditional analytical workflows to annotate the genes within each GC of the known fraction. With AGNOSTOS, we provide the opportunity to integrate the unknown fraction into microbiome analyses easily. (C) The versatility of the GCs enables analyses at different scales depending on the scope of our experiments. We can group GCs in gene cluster communities based on their shared homologies to perform coarse-grained analyses. On the other hand, we can design fine-grained analyses using the relationships between the genes in a GC, that is detecting network modules in the GC inner sequence similarity network. Additionally, given that GCs are conserved across environments, organisms and experimental conditions give us access to an unprecedented amount of information to design and interpret experimental data.

Overview and validation of the workflow to aggregate GCs in communities.

(A) We inferred a gene cluster homology network using the results of an all-vs-all HMM gene cluster comparison with HHBLITS. The edges of the network are based on the HHblits-score/Aligned-columns. Communities are identified by an iterative screening of different MCL inflation parameters and evaluated using five different metrics that consider the inter- and intra-community properties. (B) Comparison of the number of GCs and GCCs for each of the functional categories. (C) Validation of the GCCs inference based on the environmental genes annotated as proteorhodopsins. Ribbons in the alluvial plot are genes, and each stacked bar corresponds (from left to right) to the (1) gene taxonomic classification at the domain level, (2) GC membership, (3) GCC membership and (4) MicRhoDE operational classification. (D) Validation of the GCCs inference based on ribosomal proteins based on standard and high-quality GCs.

The extent of the known and unknown sequence space.

(A) Proportion of genes in the known and unknown. (B) Accumulation curves for the known and unknown sequence space at the GC- level for the metagenomic and genomic data. from TARA, MALASPINA, OSD2014 and HMP-I/II projects. (C) Collector curves comparing the human and marine biomes. Colored lines represented the mean of 1000 permutations and shaded in gray the standard deviation. Non-abundant singleton clusters were excluded from the accumulation curves calculation. (D) Amino acid distribution in the known and unknown sequence space. In all cases, the four categories have been simplified as known (K, KWP) and unknown (GU, EU).

Distribution of the unknown sequence space in the human and marine metagenomes.

(A) Ratio between the proportion of the number of genes and their estimated abundances per cluster category and biome. Columns represented in the facet depicts three cluster categories based on the size of the clusters. (B) Relationship between the ratio of Genomic unknowns and Environmental unknowns in the HMP-I/II metagenomes. Gastrointestinal tract metagenomes are enriched in Genomic unknown sequences compared to the other body sites. (C) Relationship between the ratio of Genomic unknowns and Environmental unknowns in the TARA Oceans metagenomes. Girus- and virus-enriched metagenomes show a higher proportion of both unknown sequences (genomic and environmental) than the Archaea|Bacteria enriched fractions. (D) Environmental distribution of GCs and GCCs based on Levin’s niche breadth index. We obtained the significance values after generating 100 null gene cluster abundance matrices using the quasiswap algorithm.

Phylogenomic exploration of the unknown sequence space.

(A) Distribution of the lineage-specific GCs by taxonomic level. Lineage-specific unknown GCs are more abundant in the lower taxonomic levels (genus, species). (B) Phylogenetic conservation of the known and unknown sequence space in 27,372 bacterial genomes from GTDB_r86. We observe differences in the conservation between the known and the unknown sequence space for lineage- and non-lineage specific GCs (paired Wilcoxon rank-sum test; all p-values <  0.0001). (C) The majority of the lineage-specific clusters are part of the unknown sequence space, and only a small proportion was found in prophages present in the GTDB_r86 genomes. (D) Known and unknown sequence space of the 27,732 GTDB_r86 bacterial genomes grouped by bacterial phyla. Phyla are partitioned based on the ratio of known to unknown GCs and vice versa. Phyla enriched in MAGs have higher proportions in GCs of unknown function. Phyla with a high proportion of non-classified clusters (NC; discarded during the validation steps) tend to contain a small number of genomes. (E) The alluvial plot’s left side shows the uncharacterized (OM-RGC v2 GC) and characterized (OM-RGC v2) fraction of the gene catalog. The functional annotation is based on the eggNOG annotations provided by Salazar et al., 2019. The right side of the alluvial plot shows the new organization of the OM-RGC v2 sequence space based on the approach described in this study. The treemap in the right links the metagenomic and genomic space adding context to the unknown fraction of the OM-RGC v2.

Augmenting experimental data with GCs of unknown function.

(A) We used the fitness values from the experiments from Price et al., 2018 to identify genes of unknown function that are important for fitness under certain experimental conditions. The selected gene belongs to the genomic unknown GC GU_19737823 and presents a strong phenotype (fitness = –3.1; t = –9.1) (B) Occurrence of GU_19737823 in the metagenomes used in this study. Darker bars depict the number of metagenomes where the GC is found. (C) GU_19737823 is a member of the GCC GU_c_21103. The network shows the relationships between the different GCs members of the gene cluster community GU_c_21103. The size of the node corresponds to the node degree of each GC. Edge thickness corresponds to the bitscore/column metric. Highlighted in red is GU_19737823. (D) We identified all the genes in the GTDB_r86 genomes that belong to the GCC GU_c_21103 and explored their genomic neighborhoods. GU_c_21103 members were constrained to the class Gammaproteobacteria, and GU_19737823 is mostly exclusive to the order Pseudomonadales. The gene order in the different genomes analyzed is highly conserved, finding GU_19737823 after the rpsF::rpsR operon and before rpll. rpsF and rpsR encode for the 30 S ribosomal protein S6 and 30 S ribosomal protein S18, respectively. The GTDB_r86 subtree only shows RefSeq genomes. Branch colors correspond to the different GCs found in GU_c_21103. The bubble plot depicts the number of genomes with a gene that belongs to GU_c_21103.

Appendix 1—figure 1
Overview of the workflow to partition the genomic and metagenomic sequence space between known and unknown.

The workflow performs gene prediction, gene clustering, gene clustering validation and refinement, GCC inference, and partitions the sequence space in the different known and unknown categories.

Appendix 1—figure 2
The diagram shows a schematic description of the number of genes and GCs that have been kept or discarded.

(A) We analyzed a dataset of 1749 metagenomes from marine and human environments and 28,941 genomes from the GTDB_r86 summing up to 415,971,742 genes. The composition of the genomic box ‘Other’ is described in Appendix Note 5. (B) GC overlap between the environmental and genomic datasets.

Appendix 1—figure 3
Proportion of complete genes per cluster.

Distribution of observed values compared with those generated by the Broken-stick model. The cut-off was determined at 34% complete genes per cluster.

Appendix 1—figure 4
Collector curves for the known and unknown sequence space.

(A) Collector curves at the gene cluster level, for the TARA metagenomes, including the viral fraction (left) and excluding it (right) from the analysis. (B) Collector curves at gene cluster community level for the metagenomes from TARA, MALASPINA, and HMP-I/II projects (left) and the 28,941 GTDB genomes (right).

Appendix 1—figure 5
Collector curves for the known and unknown sequence space at the gene cluster level for (A) the metagenomes from TARA, MALASPINA and HMP-I/II projects, and for (B) the 28,941 GTDB genomes.

Singletons were excluded from the calculations.

Appendix 1—figure 6
Proportion of gene cluster categories per biome.

On the y-axis are reported the 11 main biome categories indicated by MGnify and in parenthesis the total number of genes in each biome. The gray fraction represents the pool of genes from MGnify that were not found in our dataset.

Appendix 1—figure 7
HMP outlier samples enriched in (A) crAssphages, and (B) papillomaviruses (HPV).
Appendix 1—figure 8
EggNOG annotations entropy within the GCs (A) and the GCCs (B).

The entropy was calculated using the function entropy.empirical() from the R package ‘entropy’, which estimates the Shannon entropy values based on the value empirical frequencies.

Appendix 3—figure 1
Proportion of outlier genes detected within each cluster MSA.

Distribution of observed values compared with those generated by the Broken-stick model. The cut-off was determined at 10% outlier genes per cluster.

Appendix 5—figure 1
Proportion of outlier genomic genes identified within each cluster MSA.

Distribution of observed values compared with those of the Broken-stick model.

Appendix 5—figure 2
Comparison of the clustering results obtained with the one-step and two-step approach in terms of cluster composition.
Appendix 7—figure 1
Radar plots used to determine the best MCL inflation value for the partitioning of the K into cluster components.

The plots were built using a combination of five variables: 1 = proportion of clusters with one component and 2 = proportion of clusters with more than one member, 3 = clan entropy (proportion of clusters with entropy = 0), 4 = intra HHblits-Score/Aligned-columns (normalized by the maximum value), and 5 = number of clusters (related to the non-redundant set of DAs). (A) Metagenomic dataset. (B) Genomic dataset.

Appendix 7—figure 2
Cluster pairs distribution based on the metrics used to weight the gene cluster HMM-HMM homology network.

(A) HHblits-Score/Aligned-columns (Vanni et al., 2021). (B) maximum(HHblits-probability x coverage) (Méheust et al.).

Appendix 7—figure 3
Determination of the edge-weight metrics for the GC HMM-HMM homology network.

We tested the metrics used in Méheust et al. and this paper (Vanni et al.). The correlations between metrics are shown per functional category. The metric used by Méheust et al. corresponds to the maximum(HHblits-probability x coverage). The metric applied in this manuscript is HHblits-Score/Aligned-columns. (A) Comparison between the metric of Méheust et al. and the HHblits-Probability. (B) Comparison between the metric used in this manuscript and the HHblits-Probability. (C) Comparison between the metric used in this manuscript and the metric of Méheust et al.

Appendix 7—figure 4
Agreement between the number of communities within ribosomal protein families between our approach and the one described in Méheust et al.
Appendix 9—figure 1
Coverage of external datasets.

The bar plot is showing the proportion of covered genes in each of the seven datasets that were screened against the metagenomic set of clusters’ HMM profiles.

Appendix 10—figure 1
Broadly distributed EU mapping on TARA MAGs results.

(A) . Histogram of TARA MAG percent completeness (checkM). The red line represents the number of EU found in the MAGs. (B) Contigs from TARA MAGs TARA_ANW_MAG_00076 in descending order of highest proportion of non-hypothetical gene content. (C) EU communities in the context of a MAG contig. Contig genomic neighborhood around two potential EU communities.

Appendix 11—figure 1
Phylogenomic exploration of the unknown sequence space in Archaea.

(A) Distribution of the lineage-specific gene clusters by taxonomic level. Lineage-specific unknown gene clusters are more abundant at the lower taxonomic levels (genus, species). (B) Phylogenetic conservation of the known and unknown sequence space in 1,569 archaeal genomes from GTDB. We calculated the mean trait depth (add symbol D) with the consenTRAIT algorithm and the lineage specificity using the F1-score approach from Mendler et al., 2019. We observe differences in the conservation between the known and the unknown sequence space for lineage- and non-lineage-specific gene clusters (paired Wilcoxon rank-sum test; all P-values <  0.0001). (C) The majority of the lineage-specific clusters are part of the unknown sequence space, being a small proportion found in prophages present in the GTDB genomes. (D) Known and unknown sequence space of the 1,569 GTDB archaeal genomes grouped by archaeal phyla. Phyla are partitioned based on the ratio of known to unknown gene clusters and vice versa from the set of genomes. Phyla enriched in Metagenomic assembled genomes (MAGs) have a higher proportion in gene clusters of unknown function.

Appendix 12—figure 1
Cand Patescibacteria metagenomic lineage-specific clusters.

(A) Phylogenetic tree of Cand. Patescibacteria genera, colored by classes. The heatmaps around the tree show the proportion of lineage-specific gene clusters of knowns and unknowns in the metagenomes from TARA, Malaspina and the HMP. (B) Metagenomic lineage-specific clusters in the class of Gracilibacteria.

Tables

Key resources table
Reagent type (species) or resourceDesignationSource or referenceIdentifiersAdditional information
Software, algorithmSnakemakeSnakemakeRRID: SCR_003475Workflow manager
Software, algorithmProdigalProdigalRRID: SCR_021246Gene prediction
Software, algorithmMMseqs2MMseqs2RRID: SCR_010277Sequence clustering and search
Software, algorithmHHMERHMMERRRID: SCR_005305Sequence-Profile search
Software, algorithmHHblitsHHblitsRRID: SCR_010277Profile-Profile search
Software, algorithmPARASAILPARASAILRRID:SCR_021805Sequence alignment
Software, algorithmFAMSAFAMSARRID:SCR_021804Sequence alignment
Software, algorithmLEON-BISLEON-BISRRID:SCR_021803Sequence alignment evaluation
Software, algorithmOD-SEQOD-SEQSequence alignment http://www.bioinf.ucd.ie/download/od-seq.tar.gz
Software, algorithmSEQKITSEQKITRRID: SCR_018926Fasta file manipulation
Software, algorithmRRRRID: SCR_002394
Software, algorithmHH-SUITEHH-SUITERRID: SCR_016133
Software, algorithmRAXMLRAXMLRRID: SCR_006086Phylogeny
Software, algorithmPPLACERPPLACERRRID: SCR_004737Phylogeny
Software, algorithmPAPARAPAPARASequence alignment https://cme.hits.org/exelixis/resource/download/software/papara_nt-2.5-static_x86_64.tar.gz
Software, algorithmAnvi’oAnvi’oRRID:SCR_021802Omics analysis and visualization https://merenlab.org/software/anvio
Software, algorithmBWA mapperBWA mapperRRID: SCR_010910Sequence alignment
Software, algorithmBEDTOOLSBEDTOOLSRRID: SCR_006646
Software, algorithmPhageBoostPhageBoosthttps://github.com/ku-cbd/PhageBoost
Software, algorithmEGGNOG-mapperEGGNOG-mapperRRID: SCR_021165
Appendix 1—table 1
Number of metagenomic clusters and genes after the validation and refinement steps.
Good-qualityBad-qualityTotal
Clusters2,940,25763,64032,465,074
Genes260,142,3548,325,409322,248,552
Appendix 1—table 2
MG +GTDB high-quality (HQ) subset of gene clusters (GCs).
CategoryHQ GCsHQ genespHQ GCspHQ genes
K76,71840,710,9360.01450.120
KWP16,9221,733,5990.003200.005132
GU95,3709,908,6300.01800.0293
EU14,207477,6250.002690.00141
Total203,21752,830,7900.03840.1562
Appendix 1—table 3
Mean proportion of complete genes per cluster in the four functional categories.
KKWPGUEU
Mean percentage of complete genes0.500.220.680.70
Appendix 1—table 4
KWP high-quality gene clusters (GCs) distribution in the COG groups.

(Full table in Supplementary file 1A).

COG groupNumber of GCsProportion of GCs
CELLULAR PROCESSES AND SIGNALING22920.135
INFORMATION STORAGE AND PROCESSING15820.0935
METABOLISM16790.0992
POORLY CHARACTERIZED28990.171
NC84700.501
Appendix 1—table 5
Environmental (metagenomic) dataset description.
(A) Number of samples and sites per metagenomic project.
DatasetReferenceSamplesSitesContigs
TARASunagawa et al., 201524214162,404,654
MalaspinaDuarte, 2015116309,330,293
OSDKopf et al., 20151451394,127,095
HMPLloyd-Price et al., 20171,2461880,560,927
DatasetReferenceSamplesSitesReads
GOSRusch et al., 2007807012,672,518
(B) Number of predicted genes per completeness category.
Total"00""10""01""11"
322,248,552118,717,690106,031,163102,966,48275,694,123
  1. Note: "00" = complete, both start and stop codon identified. "01" = right boundary incomplete. "10" = left boundary incomplete. "11" = both left and right edges incomplete.

Appendix 1—table 6
Summary of the number of EU clusters based on their presence in MAGs and their environmental distribution, obtained with the Levin’s Niche Breadth index.
Total clustersBroadNarrowNon-significant
Total EU204,0314718421195,079
EU in MAGs55,5208831655,116
EU not in MAGs148,511 (73%)383 (81%)8105 (96%)140,023 (72%)
Appendix 1—table 7
Number of lineage-specific gene clusters of unknown function at different taxonomic levels within the Cand.

Patescibacteria phylum.

Taxonomic levelNumber of clusters
Phylum2
Class6
Order104
Family1456
Genus6987
Species45,788
Appendix 1—table 8
Shannon entropy values for the eggNOG annotations within the gene clusters.
Min.1st qu.MedianMean3rd qu.Max.
Entropy per GC0.0000.0000.0000.1050.0003.729
Appendix 1—table 9
Shannon entropy values for the eggNOG annotations within the gene clusters communities.
Min.1st qu.MedianMean3rd qu.Max.
Entropy per GCC0.0000.0000.0000.2850.4003.721
Appendix 2—table 1
Singletons and small GCs Pfam annotations.
TotalAnnotatedNot annotated
Singletons19,911,324934,54818,976,776
Small GCs9,549,8531,028,0768,521,777
Appendix 2—table 2
Number of singletons and small GCs per functional category.
KKWPGUEU
Singletons852,4133,505,1612,763,47612,790,274
Small GCs946,1122,213,6542,744,2623,645,825
Appendix 3—table 1
Number of spurious, shadow and outlier genes in the metagenomic clusters.
Gene categoryClusters ≥ 10 genesClusters < 10 genesSingletons
Spurious44,20567842,335
Shadow289,258144,571177,126
Outliers3,118,850--
Appendix 3—table 2
Metagenomic gene cluster validation results.
(A) Evaluation of cluster sequence composition.
Pre-Compos. validationgood qualitybad quality
Clusters3,003,8972,958,26645,631
Genes268,467,763266,268,6382,199,125
(B) Evaluation of cluster Pfam functional annotations.
Pre-Funct. validationFunct. goodFunct. bad
Clusters1,015,9241,004,16611,758
Genes181,433,541178,167,5833,246,002
Appendix 3—table 3
Steps: Step I - Removing of the "bad clusters".

Step II - Removing of the "shadow clusters". Step III - Removing single spurious, shadow or outlier genes.

(A) Number of clusters in each step of the cluster refinement.
Step IStep IIStep IIIRefined
Clusters3,003,8972,946,8452,940,5932,940,257
Removed–57,052–6,252–336
(B) Number of genes in each step of the cluster refinement.
Step IStep IIStep IIIRefined
Genes268,467,763263,022,636262,851,348260,142,354
Removed–5,445,127–171,288–2,708,994
Appendix 4—table 1
Metagenomic gene clusters classification steps.
(A) Results from the search against the UniRef90 database
Search vs UniRef90HitsNo-hits
Initial clusters:1,946,7371,581,115365,622
CharacterizedHypothetical
749,439831,676
(B) Results from the search against the and the NCBI nr databases
Search vs NCBI nrHitsNo-hits
Initial clusters: 365,62220,277345,345
CharacterizedHypothetical
4,27915,998
(C) Classification of the Pfam annotated GCs based on the consensus DAs.
Consensus DA analysisAnnotated to DKF DAsAnnotated to DUF DAs
Initial clusters: 993,520912,55180,969
Appendix 4—table 2
Metagenomic GC remote homology refinement steps.
KKWPGUEU
Initial GCs912,551753,718928,643345,345
EU refinement-+ 38,333+ 171,183–209,516
Post-EU refinement912,551792,0511,099,826135,829
KWP refinement+ 137,615–159,598+ 21,983-
Refined GCs1,050,166632,4531,121,809135,829
Appendix 5—table 1
GTDB integration in the metagenomic dataset.
MetagenomicSharedGenomicTotal
GCs30,301,6932,163,3817,958,47540,423,549
Genes199,693,614190,001,31426,276,814415,971,742
Appendix 5—table 2
Genomic GC validation results.
(A) Evaluation of cluster sequence composition.
Pre-Compos. validationgood qualitybad quality
GCs2,400,0372,361,58538,452
Genes20,718,37620,364,454353,922
(B) Evaluation of Pfam functional annotations.
Pre-Funct. validationgood qualitybad quality
GCs556,834542,41014,424
Genes10,091,2039,865,550225,653
(C) Combined cluster validation results.
Pre-validationgood qualitybad quality
GCs2,400,0372,347,50252,535
Genes20,718,37620,141,636576,740
Appendix 5—table 3
Spurious, shadow, and outlier genes in the genomic GCs.
Gene categoryGCs ≥ 2 genesSingletons
Spurious3,2521,312
Shadow223,535125,262
Outliers449,080-
Appendix 5—table 4
Non-annotated genomic GC classification.
(A) Results from the search against the UniRef90 database.
Search vs UniRef90HitsNo-hits
Initial GCs: 1,816,9991,570,094246,905
CharacterizedHypothetical
304,0041,266,090
(B) Results from the search against the NCBI nr database.
Search vs NCBI nrHitsNo-hits
Initial GCs: 246,90528,704218,201
CharacterizedHypothetical
1,28027,424
(C) Classification of the Pfam annotated GCs based on the consensus DAs.
Consensus DA analysisDKF DAsDUF DAs
Initial GCs: 993,520912,55165,688
Appendix 5—table 5
Genomic GC remote homology refinement and final genomic GC dataset.
(A) Remote-homology refinement steps.
KKWPGUEU
Initial GCs464,815305,2841,359,202218,201
EU refinement-+ 5,704+ 144,295–149,999
Post-EU refinement464,815310,9881,503,49768,202
KWP refinement+ 152,529–174,582+ 22,053-
Refined GCs617,344136,4061,525,55068,202
(B) Genomic GC refined dataset.
KKWPGUEUTotal
Genes9,997,529663,1079,305,621175,37920,141,636
GCs617,344136,4061,525,55068,2022,347,502
Appendix 5—table 6
Genomic high quality (HQ) GCs.
CategoryHQ GCsHQ genespHQ GCspHQ genes
K12,20225,105,1560.01980.0096
KWP4,0191,349,1650.02950.0214
GU12,6998,403,3930.00830.0062
EU438471,8200.00640.0074
Appendix 5—table 7
MG +GTDB seed database.

Integrated number of genes and GCs per category.

KKWPGUEUTotal
Genes230,641,7632,754,36568,509,3353,534,207335,439,673
GCs1,667,510768,8592,647,359204,0315,287,759
Appendix 5—table 8
Overview of genomic genes found homologous to metagenomic genes.
TotalIn MG good-quality GCsIn MG small GCsIn MG singletonsIn MG bad-quality GCs
Genes67,446,37655,155,6837,010,9873,700,8441,578,862
Appendix 5—table 9
Comparison of one-step and two-step clustering results in numbers.
ApproachTotal number of gene clustersOf which singletons
One-step5,430,7803,770,230
Two-step5,462,0063,779,961
Appendix 6—table 1
Number of MG +GTDB GCs annotated to the DPD per functional category.
KKWPGUEU
374,5558,87422,1350
Appendix 7—table 1
Number of gene clusters, cluster communities, and reduction rate shown by functional category.
(A) Metagenomic dataset (MG)
KKWPGUEUTotal
Clusters1,050,166632,4531,121,809135,8292,940,257
Communities24,18164,938146,10048,095283,314
Reduction (%)97.789.7386.9864.5990.36
(B) Genomic dataset (GTDB)
KKWPGUEUTotal
Clusters617,344136,4061,525,55068,2022,347,502
Communities52,36047,203339,46857,899496,930
Reduction (%)91.5265.3977.7515.1179.30
Appendix 7—table 2
Measures of similarity between the community inference approach proposed in this paper, the one used in Méheust et al and the "ground truth" represented by the ribosomal protein families.
Vanni et al. vs meheust et al.Vanni et al. vs ribosomal familiesMeheust et al. vs ribosomal families
ARI0.9150.9440.906
AMI0.9280.9160.878
NVI0.1010.08580.124
NID0.07170.08410.122
NMI0.9280.9160.878
  1. Note: ARI = Adjusted Rand Index; AMI = Adjusted Mutual Information; NVI = Normalized Variation Information; NID = Normalized Information Distance; NMI = Normalized Mutual Information.

Appendix 8—table 1
Number of genomic singletons per functional category.
KKWPGUEU
Genes473,460896,1272,528,3701,660,481
Appendix 8—table 2
Minimum slope values for the collector curves.
(A )Excluding singletons. In parenthesis, the number of genomes or metagenomes for the first occurrence of slope <1
Gene ClustersGene cluster Communities
metaGGTDBmetaGGTDB
Known209.2356.5560.1344 (440)0.07 (15,120)
Unknown374.51475.8510.1375 (600)0.621 (27,690)
(B) Including singletons (with a mode abundance in the samples of 8.36).
Gene Clusters
metaGGTDB
Known1329.48966.063
Unknown4843.570158.891
Appendix 9—table 1
Re-classification of the unknowns identified in Wyman et al and Price et al.
StudyOriginal unknown setCovered fractionFound as knownFound as unknown
Wyman et al.61,97038,17412,36625,808
Price et al.49,73633,01621,96711,049
Appendix 12—table 1
Number of lineage-specific clusters within the Cand.

Patescibacteria phylum, at different taxonomic levels, subdivided by cluster categories.

Taxonomic levelKKWPGUEU
Phylum1020
Class11060
Order4111040
Family45291,44313
Genus625986,649338
Species4,11681842,7103,078

Additional files

Supplementary file 1

Supplementary tables.

(a) KWP high-quality gene clusters (GCs) distribution in the COG groups. (b) Proportion of genes in each cluster category, and Pfam amino acids coverage per cluster category. (c) List of HMP outlier samples. (d) Number of phylogenetic conserved and lineage-specific gene clusters (GCs) in the GTDB bacterial phylogeny. (e) Clusters in the GU community GU_c_21103. (f) List of filtered samples used for the metagenomic analyses. (g) List of terms commonly used to define proteins of unknown function in public databases. (h) Sequence similarity values between viral genes and Needham et al. viral PRs. (i) Number of phylogenetic conserved and lineage-specific GCs in the GTDB archaeal phylogeny.

https://cdn.elifesciences.org/articles/67667/elife-67667-supp1-v2.xlsx
Supplementary file 2

Supplementary tables describing general cluster properties.

(a) Overall properties for the GCs of the integrated dataset (MG + GTDB). (b) Statistics for the integrated dataset (MG+GTDB). (c) Taxonomic variation within each gene cluster category. (d) Statistics for the metagenomic dataset. (e) Statistics for the genomic dataset.

https://cdn.elifesciences.org/articles/67667/elife-67667-supp2-v2.xlsx
Transparent reporting form
https://cdn.elifesciences.org/articles/67667/elife-67667-transrepform1-v2.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Chiara Vanni
  2. Matthew S Schechter
  3. Silvia G Acinas
  4. Albert Barberán
  5. Pier Luigi Buttigieg
  6. Emilio O Casamayor
  7. Tom O Delmont
  8. Carlos M Duarte
  9. A Murat Eren
  10. Robert D Finn
  11. Renzo Kottmann
  12. Alex Mitchell
  13. Pablo Sánchez
  14. Kimmo Siren
  15. Martin Steinegger
  16. Frank Oliver Gloeckner
  17. Antonio Fernàndez-Guerra
(2022)
Unifying the known and unknown microbial coding sequence space
eLife 11:e67667.
https://doi.org/10.7554/eLife.67667