On the Symbiosis of Generative Modeling and Representation Learning

Zhang, Xiao

doi:10.6082/uchicago.15842

On the Symbiosis of Generative Modeling and Representation Learning

Zhang, Xiao

2025

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Cite

Files

Abstract

Generative modeling and representation learning are core pillars of modern machine learning and computer vision. In recent years, the field has progressed from analyzing existing visual data to building generative models that can synthesize realistic and diverse visual content. These models offer not only powerful tools for content creation but also a unique perspective on visual understanding—by learning to reconstruct visual structures, they reveal how patterns can be captured, organized, and computationally represented. This thesis investigates the bidirectional relationship between generative modeling and representation learning through two complementary perspectives. The first part of this thesis focuses on enhancing the representation learning capabilities of generative models. We begin by identifying a key limitation in standard architectural designs, specifically, how residual connections in generative models tend to favor high-rank features, which biases learning toward low-level textures rather than semantically meaningful abstractions. To address this, we introduce a decayed residual connection that penalizes the contribution of skip connections, effectively encouraging the model to learn compact, low-rank representations. This design significantly improves both representation quality and generative performance in masked autoencoders and diffusion-based models. However, although diffusion models inherently learn useful representations, obtaining a compact and coherent low-dimensional embedding remains difficult due to the distributed nature of the representation across multiple noise levels and layers. Inspired by classical spectral methods, we propose an efficient distributed spectral clustering algorithm that aggregates features from various stages of the model to form a compact, semantically rich embedding. We further extend our analysis to the generative adversarial network (GAN) framework. Observing that GAN discriminators often learn meaningful features, we introduce a novel representation-aware learning objective along with a capacity-preserving regularization technique. This approach enhances the quality of features learned by the discriminator, yielding improvements that make them useful for downstream semantic tasks. The second part of this thesis examines how learned representations can be used to enhance the quality of generation. We develop a hierarchical generative model that operates in a cascade of semantic spaces, ranging from global structure to fine-grained details, extracted from a pretrained visual encoder. A set of diffusion models is trained to sequentially reconstruct these semantic features using denoising objectives. We demonstrate that a semantic-aware latent representation, such as a 256-dimensional vector from a CLIP encoder, achieves a significantly higher compression ratio than traditional VAE latents, preserving almost all visual information in a 256×256 image. This architecture not only improves sample quality but also accelerates training and outperforms larger models that use more data. Finally, we explore how physics-informed representations can further enhance generation capabilities. By incorporating an autoencoder with a latent bottleneck designed to reflect physical properties—specifically, intrinsic reflectance and lighting—we enable the model to disentangle and manipulate scene properties. This allows for unsupervised generation of albedo maps and realistic image relighting.