Deciphering High-Dimensional Biological Data—From Delineating Transcriptional Dynamics to Unraveling Protein-Protein Interaction Landscapes

Gao, Cheng

doi:10.6082/uchicago.16571

Deciphering High-Dimensional Biological Data—From Delineating Transcriptional Dynamics to Unraveling Protein-Protein Interaction Landscapes

Gao, Cheng

2025

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Cite

Files

Abstract

Understanding complex biological systems requires innovative computational approaches that can extract meaningful patterns from high-dimensional data. This dissertation presents two complementary frameworks that address fundamental challenges in deciphering biological dynamics: inferring cellular differentiation trajectories from single-cell RNA sequencing (scRNA-seq) data and mapping protein-protein interaction (PPI) landscapes through high-throughput screening. In Chapter 2, I introduce TopicVelo, a novel computational method for RNA velocity inference that addresses critical limitations in existing approaches for trajectory inference from scRNA-seq data. TopicVelo combines probabilistic topic modeling with a physically meaningful transcriptional burst model to dissect and integrate distinct gene programs operating simultaneously within heterogeneous cell populations. By leveraging topic modeling to identify process-specific genes and cells, TopicVelo captures detailed dynamics that are obscured in global analyses. The method employs a stochastic transcriptional burst model to infer kinetic parameters more accurately than deterministic approaches, explicitly accounting for transcriptional stochasticity, then integrates process-specific transition matrices using cell-specific topic weights to construct a global model of cell state transitions. Across diverse developmental systems—including human hematopoiesis, mouse erythropoiesis, dentate gyrus development, and pancreatic endocrinogenesis—TopicVelo recovers biologically accurate trajectories without requiring metabolic labeling or multiple time points. The method identifies terminal cell states, quantifies transition dynamics through mean first-passage time analysis, and reveals complex trajectories including bidirectional transitions and convergent differentiation paths that are missed by existing approaches. In Chapter 3, I present a comprehensive data science framework for analyzing PPI fitness landscapes using Phage-Assisted Non-Continuous Selection (PANCS) combined with machine learning. Through systematic analysis of ~1,300 selections across 96 diverse target proteins and 10 distinct randomization strategies of the affibody scaffold, we demonstrate that binder density—the probability that a random variant binds a target—varies dramatically across targets (from >10⁻⁵ to <10⁻⁸) and is primarily controlled by target-specific features rather than design strategy. Using unsupervised clustering of next-generation sequencing data, we reveal that PPI landscapes exhibit reproducible topologies characterized by discrete clusters of binders sharing minimal binding motifs, with motif frequency strongly correlated with binder density. To enable predictive modeling, I develop a supervised learning framework combining protein language models (ESM-2) with ranked contrastive learning that co-embeds target proteins and potential binders in a shared latent space where proximity reflects binding affinity. The model successfully discriminates binders from non-binders on held-out test data, with performance correlated with target difficulty. Together, these frameworks demonstrate how integrating mechanistic modeling with machine learning transforms high-dimensional biological data into actionable insights. TopicVelo provides an interpretable approach to understanding cellular dynamics by explicitly modeling transcriptional processes underlying cell state transitions. The PANCS framework reveals fundamental principles governing protein-protein interactions and establishes a paradigm for experimental-computational integration in protein engineering. Both methods embrace the perspective that complex biological systems can be understood through lower-dimensional representations that capture essential mechanisms while remaining grounded in biological and physical principles.