Files
Abstract
The genetic code carries instructions for the development and functioning of every biological organism. Variation in this code may cause mis-regulation of genes expression, affect cellular states, and ultimately lead to observable changes in organism-level traits. Genome Wide Association Studies (GWAS) have discovered thousands of significant statistical associations between single-nucleotide polymorphisms (SNPs) and disease/traits in human, however, functional interpretation of these associations remains challenging. To gain mechanistic insights into the relationship between genetic variations and their phenotypes, a comprehensive understanding of the gene regulatory architecture is the first and fundamental step. This dissertation addresses some of the challenges in unravelling the regulatory functions in non-coding regions and effects of genetic variations at the transcription level. I develop novel computational frameworks and statistical methods that complement experimental approaches, which combined together, aid the discovery of regulatory elements and functional disease variants, and improve the understanding of the genetic basis of diseases. In Chapter 1, using ATAC-seq and RNA-seq data of human neurons, I map cis-regulatory elements and investigate their interactions with transcription factors and target genes, deriving preliminary gene regulatory networks for autism risk genes. In Chapter 2, we outline a novel framework for integrating GWAS results with the allelic-imbalance open chromatin (ASoC) information captured by ATAC-seq. Leveraging ASoC in neurons, we prioritize putative causal non-coding SNP in schizophrenia GWAS. Data analysis of single-cell CRISPRi screen with RNA-seq readout further confirm the regulatory functions at six SNP loci and their corresponding target genes. However, due to the novelty of high-throughput single-cell CRISPR screen technologies, statistical methods for effective analysis and interpretation of such data are lacking. In Chapter 3, I develop a novel Bayesian factor analysis method that can detect from these perturbed expression data genes and gene modules impacted by the CRISPR perturbations. I apply this method to simulated and publicly available datasets. In addition to identifying biologically relevant gene modules, the method has better power to detect differentially expressed genes than alternative methods, shedding light on the regulatory basis underlying T cell activation and neuronal differentiation.