Files
Abstract
Biomolecular design plays a critical role across sectors—enabling advances in drug discovery, materials science, energy, and sustainability—while AI-driven approaches are emerging as a transformative force. Yet realizing reliable AI-driven biomolecular design remains challenging due to the need for controllable, constraint-aware generation, unified multimodal integration, robustness to noisy and sparse data, and efficient optimization under costly experimental feedback and vast design spaces, all while preserving biophysical fidelity. This dissertation unifies these challenges under the CURED framework—Controllability, Unified multimodality, Robustness, Efficiency, and Dependability on biological principles—and addresses them by integrating advanced generative models with active reinforcement learning (RL), grounded in theoretical guarantees and first-principles biology. By tightly coupling generative modeling and reinforcement learning—where generative models serve as oracles or world models to guide exploration and data acquisition, reinforcement learning refines generative policies, and both co-evolve across pretraining, post-training, and inference—this work enables controllable, data-efficient, and biologically grounded biomolecular design. Chapter 2 introduces a bi-hierarchical multimodal protein representation framework that integrates sequence-based protein language models with structure-aware graph neural networks. Through bidirectional hierarchical fusion, it learns biologically informed representations and establishes a strong foundation for downstream generative modeling across protein-level, protein–ligand, and protein–protein interaction tasks. Building on this representational foundation, Chapters 3–6 develop a unified family of GPT-based biomolecular generation frameworks for lead discovery and optimization. Chapter 3 presents DrugImproverGPT, which combines GPT pretraining with structured policy optimization (SPO) post-training to enable targeted molecular property optimization while preserving chemical validity and similarity. Chapter 4 introduces ControllableGPT, a controllable pretraining paradigm that couples a causally masked sequence-to-sequence objective with controllable decoding, enabling precise and interpretable molecular edits for lead optimization. Chapter 5 proposes ScaffoldGPT, a scaffold-centric framework integrating multi-stage pretraining, RL post-training, and decoding-level optimization to achieve robust scaffold-preserving molecular design under biophysical constraints. Chapter 6 further extends this line with FragmentGPT, the first GPT-based model unifying fragment growing, linking, and merging, enabled by chemically and energy-aware pretraining and Reward Ranked Alignment with Expert Exploration (RAE) for diversity and multi-objective optimization. Chapters 7–10 establish a principled foundation for active learning and reinforcement learning with multiple experts. Chapter 7 introduces Contextual Active Model Selection (CAMS) with theoretical guarantees for cost-aware expert querying. Chapters 8 and 9 extend this paradigm to full RL, developing algorithms for active policy selection and a robust self-improvement framework that unifies imitation learning and RL under imperfect oracles. Chapter 10 presents Active Advantage-Aligned Reinforcement Learning (A3RL), bridging offline and online RL through confidence-aware sampling to improve sample efficiency under limited data. Finally, Chapter 11 introduces Entropy-Reinforced Planning (ERP) to enhance inference-time exploration, and Chapter 12 synthesizes prior advances with MCTD-ME, a planning-augmented diffusion framework that unifies masked diffusion, multi-expert learning, RL-based planning, and biophysical principles for scalable protein design. Together, these contributions advance the theoretical and practical foundations of AI-driven biomolecular design and establish a cohesive framework that unifies generative modeling and active reinforcement learning to address the CURED challenges—enabling controllable, multimodal, robust, efficient, and biophysically grounded design. By bridging reinforcement learning, generative AI, and first-principles biophysics, this foundation pave the way for more robust, diverse, and biologically faithful design systems that can accelerate discovery across drug development, protein engineering, and the broader life sciences.