Published July 7, 2023 | Version v1
Journal article Open

A flexible empirical Bayes approach to multivariate multiple regression, and its improved accuracy in predicting multi-tissue gene expression from genotypes

Description

Predicting phenotypes from genotypes is a fundamental task in quantitative genetics. With technological advances, it is now possible to measure multiple phenotypes in large samples. Multiple phenotypes can share their genetic component; therefore, modeling these phenotypes jointly may improve prediction accuracy by leveraging effects that are shared across phenotypes. However, effects can be shared across phenotypes in a variety of ways, so computationally efficient statistical methods are needed that can accurately and flexibly capture patterns of effect sharing. Here, we describe new Bayesian multivariate, multiple regression methods that, by using flexible priors, are able to model and adapt to different patterns of effect sharing and specificity across phenotypes. Simulation results show that these new methods are fast and improve prediction accuracy compared with existing methods in a wide range of settings where effects are shared. Further, in settings where effects are not shared, our methods still perform competitively with state-of-the-art methods. In real data analyses of expression data in the Genotype Tissue Expression (GTEx) project, our methods improve prediction performance on average for all tissues, with the greatest gains in tissues where effects are strongly shared, and in the tissues with smaller sample sizes. While we use gene expression prediction to illustrate our methods, the methods are generally applicable to any multi-phenotype applications, including prediction of polygenic scores and breeding values. Thus, our methods have the potential to provide improvements across fields and organisms.

Data availability

The genotype and expression data used in our analyses are available from dbGaP (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v8.p2). All code implementing the simulations, and the compiled results generated from our simulations have been deposited on Zenodo (https://doi.org/10.5281/zenodo.8014360). The methods are implemented in the R package mr.mash.alpha, available for download at https://github.com/stephenslab/mr.mash.alpha.

Files

journal.pgen.1010539.pdf

Files (4.1 MB)

Name Size Download all
Supporting information
md5:cc650973428b119106bf96b27bfc7896
2.8 MB Preview Download
Article
md5:1ce058f9166c0cc7b5bef3f172fba2d1
1.3 MB Preview Download

Additional details

Identifiers

DOI
10.1371/journal.pgen.1010539
Other
oai:uchicago.tind.io:6796

Funding

National Institute of General Medical Sciences
P20GM139769
National Institute of General Medical Sciences
R35GM146868
National Human Genome Research Institute
R01HG002585
National Institute of Aging
R01AG076901

UChicago Information

Division(s)
Biological Sciences Division, Physical Sciences Division
Department(s)
Human Genetics, Medicine, Statistics