000010183 001__ 10183 000010183 005__ 20241212101655.0 000010183 02470 $$ahttps://doi.org/10.1371/journal.pcbi.1011705$$2doi 000010183 037__ $$aTEXTUAL 000010183 037__ $$bArticle 000010183 041__ $$aeng 000010183 245__ $$aStatistical prediction of microbial metabolic traits from genomes 000010183 269__ $$a2023-12-19 000010183 336__ $$aArticle 000010183 520__ $$aThe metabolic activity of microbial communities is central to their role in biogeochemical cycles, human health, and biotechnology. Despite the abundance of sequencing data characterizing these consortia, it remains a serious challenge to predict microbial metabolic traits from sequencing data alone. Here we culture 96 bacterial isolates individually and assay their ability to grow on 10 distinct compounds as a sole carbon source. Using these data as well as two existing datasets, we show that statistical approaches can accurately predict bacterial carbon utilization traits from genomes. First, we show that classifiers trained on gene content can accurately predict bacterial carbon utilization phenotypes by encoding phylogenetic information. These models substantially outperform predictions made by constraint-based metabolic models automatically constructed from genomes. This result solidifies our current knowledge about the strong connection between phylogeny and metabolic traits. However, phylogeny-based predictions fail to predict traits for taxa that are phylogenetically distant from any strains in the training set. To overcome this we train improved models on gene presence/absence to predict carbon utilization traits from gene content. We show that models that predict carbon utilization traits from gene presence/absence can generalize to taxa that are phylogenetically distant from the training set either by exploiting biochemical information for feature selection or by having sufficiently large datasets. In the latter case, we provide evidence that a statistical approach can identify putatively mechanistic genes involved in metabolic traits. Our study demonstrates the potential power for predicting microbial phenotypes from genotypes using statistical approaches. 000010183 536__ $$oNational Science Foundation$$cMCB 2117477$$aBiology Directorate 000010183 536__ $$oNational Science Foundation$$cMCB 1921439$$aBiology Directorate 000010183 536__ $$oNational Institutes of Health$$c1R01GM151538 000010183 536__ $$oNational Science Foundation$$c2317138$$aCenter for Living Systems 000010183 540__ $$a<p>© 2023 Li et al.</p> <p>This is an open access article distributed under the terms of the <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">Creative Commons Attribution License</a>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</p> 000010183 542__ $$fCC BY 000010183 594__ $$a<p>Raw sequencing data and genome assemblies are available at the original source of each study (NCBI BioProject PRJNA660495 and PRJNA513156 for genomes from Gowda et al. (2022), PRJNA540276 for genomes from Muscarella et al. (2019), PRJNA940744 for genomes from Prabhakara et al. (2023) (sequenced in this study), and Gralka et al. (2023)). The analysis data are publicly available on Open Science Framework (<a href="https://doi.org/10.17605/OSF.IO/JWKR7">https://doi.org/10.17605/OSF.IO/JWKR7</a>). The bioinformatic pipeline and all data analysis code are available at&nbsp;<a href="https://github.com/zeqianli/CarbonUtilization">https://github.com/zeqianli/CarbonUtilization</a>.</p> 000010183 690__ $$aBiological Sciences Division 000010183 690__ $$aPhysical Sciences Division 000010183 691__ $$aBiophysical Sciences 000010183 691__ $$aEcology and Evolution 000010183 691__ $$aPhysics 000010183 692__ $$aCenter for the Physics of Evolving Systems 000010183 7001_ $$1https://orcid.org/0000-0002-0884-8028$$2ORCID$$aLi, Zeqian$$uUniversity of Chicago 000010183 7001_ $$aSelim, Ahmed$$uUniversity of Chicago 000010183 7001_ $$1https://orcid.org/0000-0002-4130-6845$$2ORCID$$aKeuhn, Seppe$$uUniversity of Chicago 000010183 773__ $$tPLOS Computational Biology 000010183 8564_ $$yArticle$$921aaef58-09f9-4cb7-806a-8b03619373b7$$s3159112$$uhttps://knowledge.uchicago.edu/record/10183/files/journal.pcbi.1011705.pdf$$ePublic 000010183 8564_ $$ySupporting information$$999f6e9cf-98b1-4c61-8b2c-ab682b5453ca$$s19837286$$uhttps://knowledge.uchicago.edu/record/10183/files/10_1371_journal_pcbi_1011705.zip$$ePublic 000010183 908__ $$aI agree 000010183 909CO $$ooai:uchicago.tind.io:10183$$pGLOBAL_SET 000010183 983__ $$aArticle