Learning Meaningful Representations of Data with Empirical Bayes Methods

Kang, Joonsuk

doi:10.6082/uchicago.12302

Kang, Joonsuk

2024

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Matrix factorization methods are widely used to uncover hidden structures within data represented by matrices. The choice of method depends on the desired data representation, such as sparsity, orthogonality, and nonnegativity. The Bayesian approach can effectively model these desired representations. Specifically, the empirical Bayes approach avoids the need to manually specify hyperparameters for each column of factor and/or loading in the model. The first chapter develops an alternative algorithm for fitting the Empirical Bayes Matrix Factorization model. The existing 'flash' algorithm updates a single factor-loading vector pair at a time while holding others fixed. Instead, our alternating least squares-type algorithm updates the entire factor matrix (or loading matrix) at once while fixing the entire loading matrix (or factor matrix). This update allows for efficient parallel implementation as it can be interpreted as solving multiple independent regression problems. The second chapter introduces a flexible class of empirical Bayes matrix factorization methods, in which a data matrix is approximated by a product of an orthogonal factor matrix and a loading matrix with column-specific priors. We demonstrate that using sparsity-inducing priors on the loading matrix leads to a sparse PCA method. Importantly, our method avoids the "multiple tuning problem" commonly encountered in sparse PCA. The final chapter presents a matrix factorization method, motivated by population genetics. The method factorizes a genotype matrix into a drift factor matrix and a drift membership matrix by combining a STRUCTURE-type method and a drift estimation method. Unlike previous approaches that represent individuals’ genotypes using populations, our method emphasizes shared genetic variation across individuals by representing individuals’ genotypes using genetic drifts, which are shared across populations. To estimate the drift factors and memberships, we propose a symmetric nonnegative matrix factorization method that penalizes deviations from a tree-based initial estimate.