Files
Abstract
The ability to predict gene expression is an essential test of our understanding, and improving our understanding of gene expression is crucial to future advances in the fields of human genetics, precision medicine, and evolutionary biology. Challengingly, the regulation of gene expression depends on a complex interplay of factors, including the gene’s DNA sequence, the gene’s three-dimensional structure, chemical modifications to the DNA (i.e., epigenetics), environmental factors, cell-type-specific transcription factor activity, etc. In this work, I present advances in a) our ability to efficiently model DNA structure using machine learning and physics-based modeling and b) our ability to predict gene expression from DNA sequence using machine learning. In Chapter 2, I describe an approach that combines a particle-based chromatin polymer model, molecular simulation, and machine learning to efficiently and accurately estimate chromatin structure from indirect measures of genome structure. Specifically, the interaction parameters of a polymer model are extracted from experimental Hi-C data using a graph neural network (GNN). We achieve accurate and high-throughput estimations of chromatin structure from Hi-C data, which will be increasingly necessary as experimental methodologies (such as single-cell Hi-C) improve and can be integrated into other computational and machine learning pipelines. In Chapter 3, I describe SCEPTER, a deep learning model for predicting cell-type-specific gene expression directly from DNA sequence. SCEPTER substantially outperforms existing approaches predicting cell-type-specific gene expression. I also show that SCEPTER can predict variant effects from saturation mutagenesis experiments and identify enhancer sequences up to 100 kb away. Ultimately, I present advances in our understanding of two of the factors that explain gene expression: DNA sequence and DNA structure. Possible future work includes a) integrating these advances into a single model that accounts for both DNA structure and sequence (e.g., by leveraging joint scHi-C/scRNA-seq) and b) incorporating additional data modalities such as epigenetic data (e.g., ChIP-seq).