Files
Abstract
Large Generative Models have transformed the AI landscape with their impressive capabilities, yet their black-box nature poses two fundamental challenges: we don't fully understand how they work, and we struggle to control them precisely. This dissertation investigates these dual challenges of interpretability and steering, exploring their interconnection across different model architectures. We first examine text-guided generative models, demonstrating that concepts can be represented as subspaces within representation space. This framework enables both mechanistic interpretation and precise manipulation of generated content. For transformer-based language models, we improve Reinforcement Learning from Human Feedback (RLHF) by introducing a principled reward transformation method that alleviates reward hacking and enables effective aggregation of multiple reward models. The dissertation then examines a core assumption in many mechanistic interpretability works: Do successful edits provide evidence for localization claims? We find empirical evidence refuting this assumption. Following this negative result, we turn to controlled experiments and stress testing for understanding model behavior. With a case study in role-separation learning in LLMs, we find LLMs often don't learn in the way we expected, but rather exploit various shortcuts in the training data.