Published May 31, 2022 | Version v1
Journal article Open

Unifying the known and unknown microbial coding sequence space

  • 1. Max Planck Institute for Marine Microbiology
  • 2. University of Chicago
  • 3. Institut de Ciències del Mar
  • 4. University of Arizona
  • 5. Alfred Wegener Institute
  • 6. Spanish Council for Research
  • 7. Université Paris-Saclay
  • 8. King Abdullah University of Science and Technology
  • 9. Wellcome Genome Campus
  • 10. University of Copenhagen
  • 11. Seoul National University
  • 12. Jacobs University Bremen

Description

Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40–60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.

Data availability

We used public data as described in the Methods section and Appendix 1-table 5. The code used for the analyses in the manuscript is available at https://github.com/functional-dark-side/functional-dark-side.github.io/tree/master/scripts, (copy archived at swh:1:rev:86968509e38902580b04a25786c5a58ba2777b21). A list with the program versions can be found in https://github.com/functional-dark-side/functional-dark-side.github.io/blob/master/programs_and_versions.txt. The code to create the figures is available at https://github.com/functional-dark-side/vanni_et_al-figures, (copy archived at swh:1:rev:4c8f60e761bcac0dd02f17d2fdbb65dcaf75707a), and the data for the figure can be downloaded from https://doi.org/10.6084/m9.figshare.12738476.v2. A reproducible version of the workflow is available at https://github.com/functional-dark-side/agnostos-wf, (copy archived at swh:1:rev:5f9e23e8ac524a533f81c57e500a60b56191b1f5). The data is publicly available at https://doi.org/10.6084/m9.figshare.12459056.

The following data sets were generated:

Vanni C Fernandez-Guerra A (2020) figshare agnostosDB_dbf02445-20200519. https://doi.org/10.6084/m9.figshare.12459056

The following previously published data sets were used:

O'Gara F Jackson S Orlic S Steinke M Busch J Duarte B Caçador I Bobrova O Marteinsson V Reynisson E Loureiro C Luna G Quero GM Löscher CR Kremp A DeLorenzo ME Øvreås L Tolman J LaRoche J Penna A Frischer M Davis T Katherine B Meyer C Ramos S Magalhães C Jude-Lemeilleur F Aguirre-Macedo ML Wang S Poulton N Jones S Collin R Fuhrman JA Conan P Alonso C Stambler N Goodwin K Yakimov MM Baltar F Bodrossy L Kamp JV Frampton DMF Ostrowski M Ruth PV Malthouse P Claus S Deneudt K Mortelmans J Pitois S Wallom D Salter I Costa R Schroeder DC Kandil MM Amaral V Biancalana F Santana R Pedrotti ML Yoshida T Ogata H Ingleton T Munnik K Rodriguez-Ezpeleta N Berteaux-Lecellier V Wecker P Cancio I Vaulot D Bienhold C Ghazal H Chaouni B Essayeh S Ettamimi S Zaid EH Boukhatem N Bouali A Chahboune R Barrijal S Timinouni M Otmani F Bennani M Mea M Todorova N Karamfilov V Hoopen P Cochrane G L'Haridon S Bizsel K CVezzi A Lauro FM Martin P Jensen RM Hinks J Gebbels S Rosselli R Pascale FD Schiavon R Santos A Villar E Pesant S Cataletto B Malfatti F Edirisinghe R (2015) OSD ID ERS667653. Ocean Sampling Day. https://github.com/MicroB3-IS/osd-analysis/wiki/Guide-to-OSD-2014-data

Sunagawa A (2015) EBI European Nucleotide Archive ID PRJEB402. TARA Oceans. https://www.ebi.ac.uk/ena/browser/view/PRJEB402

Rusch DB Halpern AL Sutton G Heidelberg KB Williamson S Yooseph S Wu D Eisen JA Hoffman JM Remington K Beeson K Tran B Smith H Baden-Tillson H Stewart C Thorpe J Freeman J Andrews-Pfannkoch C Venter JE Li K Kravitz S Heidelberg JF Utterback T Rogers Y Falcón LI Souza V Bonilla-Rosso G Eguiarte LE Karl DM Sathyendranath S Platt T Bermingham E Gallardo V Tamayo-Castillo G Ferrari MR Strausberg RL Nealson K Friedman R Frazier M Venter JC (2007) NCBI BioProject ID PRJNA13694. Global Ocean Sampling. https://www.ncbi.nlm.nih.gov/bioproject?cmd=PRJNA13694

Mendler K Chen H PArks DH Lobb B Hug LA Doxey AC (2019) Annotree-Genome Taxonomy Database ID GTDB_r86. Annotree-GTDB_r86. https://data.ace.uq.edu.au/public/misc_downloads/annotree/r86/

Lloyd-Price JMahurkar ARahnavard GCrabtree JOrvis JHall ABBrady ACreasy HHMcCracken CGiglio MGMcDonald DFranzosa EAKnight RWhite OHuttenhower C (2017) Human Microbiome Project ID HMP. HMP (phase I and II). http://hmpdacc.org/

 

Files

elife-67667-v2.pdf

Files (8.3 MB)

Name Size Download all
Article
md5:a5ea65647faead0fb60ca37231d834bb
8.0 MB Preview Download
md5:04587d5d54d60bbeb518c20459ccd326
365.4 kB Preview Download

Additional details

Identifiers

DOI
10.7554/eLife.67667
Other
oai:uchicago.tind.io:9842

Funding

Max Planck Society
Horizon 2020
INMARE
Biotechnology and Biological Sciences Research Council
European Molecular Biology Laboratory
Spanish Agency of Science MICIU/AEI/FEDER
INTERACTOMA RTI2018-101205-B-I00
Spanish Ministry of Economy and Competitiveness
MAGGY

UChicago Information

Division(s)
Biological Sciences Division
Department(s)
Medicine