Research Article

Unifying the known and unknown microbial coding sequence space

Max Planck Institute for Marine Microbiology, Germany
University of Chicago, United States
Institut de Ciències del Mar-CMIMA (CSIC), Spain
University of Arizona, United States
Alfred Wegener Institute, Germany
Spanish Council for Research, Spain
Genoscope, Institut François Jacob, CEA, CNRS, France
King Abdullah University of Science and Technology, Saudi Arabia
European Molecular Biology Laboratory, United Kingdom
University of Copenhagen, Denmark
Seoul National University, Republic of Korea
University of Bremen, Germany

Mar 31, 2022

Open access
Copyright information

Abstract
Data availability
Article and author information
Metrics

Abstract

Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40%-60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1,749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.

Data availability

We used public data as described in the Methods section and Appendix 1-table 5.The code used for the analyses in the manuscript is available at https://github.com/functional-dark-side/functional-dark-side.github.io/tree/master/scripts. A list with the program versions can be found in https://github.com/functional-dark-side/functional-dark-side.github.io/blob/master/programs_and_versions.txt.The code to create the figures is available at https://github.com/functional-dark-side/vanni_et_al-figures, and the data for the figure can be downloaded from https://doi.org/10.6084/m9.figshare.12738476.v2. A reproducible version of the workflow is available at https://github.com/functional-dark-side/agnostos-wf.The data is publicly available at https://doi.org/10.6084/m9.figshare.12459056.

The following data sets were generated

(2020) agnostosDB_dbf02445-20200519
agnostosDB.

https://doi.org/10.6084/m9.figshare.12459056

The following previously published data sets were used

Article and author information

Author details

Chiara Vanni

Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Bremen, Germany

Competing interests
The authors declare that no competing interests exist.
Matthew S Schechter

Department of Medicine, University of Chicago, Chicago, United States

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0002-8435-3203
Silvia G Acinas

Department of Marine Biology and Oceanography, Institut de Ciències del Mar-CMIMA (CSIC), Barcelona, Spain

Competing interests
The authors declare that no competing interests exist.
Albert Barberán

Department of Environmental Science, University of Arizona, Tucson, United States

Competing interests
The authors declare that no competing interests exist.
Pier Luigi Buttigieg

Helmholtz Centre for Polar and Marine Research, Alfred Wegener Institute, Bremerhaven, Germany

Competing interests
The authors declare that no competing interests exist.
Emilio O Casamayor

Center for Advanced Studies of Blanes CEAB-CSIC, Spanish Council for Research, Blanes, Spain

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0001-7074-3318
Tom O Delmont

Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Paris, France

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0001-7053-7848
Carlos M Duarte

Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

Competing interests
The authors declare that no competing interests exist.
A Murat Eren

Department of Medicine, University of Chicago, Chicago, United States

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0001-9013-4827
Robert D Finn

European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, United Kingdom

Competing interests
The authors declare that no competing interests exist.
Renzo Kottmann

Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Bremen, Germany

Competing interests
The authors declare that no competing interests exist.
Alex Mitchell

European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, United Kingdom

Competing interests
The authors declare that no competing interests exist.
Pablo Sánchez

Department of Marine Biology and Oceanography, Institut de Ciències del Mar-CMIMA (CSIC), Barcelona, Spain

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0003-2787-822X
Kimmo Siren

Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark

Competing interests
The authors declare that no competing interests exist.
Martin Steinegger

School of Biological Sciences, Seoul National University, Seoul, Republic of Korea

Competing interests
The authors declare that no competing interests exist.
Frank Oliver Gloeckner

MARUM, Helmholtz Center for Polar and Marine Research, University of Bremen, Bremen, Germany

Competing interests
The authors declare that no competing interests exist.
Antonio Fernàndez-Guerra

Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark

For correspondence
antonio.fernandez-guerra@sund.ku.dk

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0002-8679-490X

Funding

Max Planck Society

Chiara Vanni

European Union's Horizon 2020 (INMARE)

Antonio Fernàndez-Guerra

Biotechnology and Biological Sciences Research Council

Alex Mitchell

European Molecular Biology Laboratory

Robert D Finn

Spanish Agency of Science MICIU/AEI (INTERACTOMA RTI2018-101205-B-I00)

Emilio O Casamayor

Spanish Ministry of Economy and Competitiveness (MAGGY (CTM2017-87736-R))

Silvia G Acinas
Pablo Sánchez

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.