Files

Abstract

In this thesis, we propose an innovative paradigm for constructing generative models, fundamentally rethinking the conventional framework used in image generation and representation learning. Our approach centers around designing a domain-specific architecture that enables unified, unsupervised image generation and representation learning. This architecture incorporates a meticulously engineered bottleneck data structure, which is crafted with an acute understanding of the specific requirements of the task at hand, the characteristics of the data involved, and the computational constraints inherent to the problem. This bottleneck structure is pivotal, as it directly addresses the tasks to be solved by facilitating a learning process that generates useful outputs without reliance on direct supervision. This stands in stark contrast to traditional methodologies, which typically involve training large-scale foundation models in a self-supervised manner and subsequently fine-tuning them on annotated data for specific downstream tasks. Our proposed method eliminates the need for such fine-tuning and does not require annotated data at any stage of the pre-training process. To demonstrate the effectiveness and robustness of our proposed design, we have conducted extensive validation across a variety of challenging tasks, each chosen to test different facets of the model under diverse experimental settings. These tests are crucial for proving the versatility and applicability of our approach in real-world scenarios, showcasing its potential to handle complex, unsupervised learning tasks in two experimental settings. For the first experimental setting, we develop a neural network architecture which, trained in an unsupervised manner as a denoising diffusion model, simultaneously learns to both generate and segment images. Learning is driven entirely by the denoising diffusion objective, without any annotation or prior knowledge about regions during training. A computational bottleneck, built into the neural architecture, encourages the denoising network to partition an input into regions, denoise them in parallel, and combine the results. Our trained model generates both synthetic images and, by simple examination of its internal predicted partitions, a semantic segmentation of those images. Without any finetuning, we directly apply our unsupervised model to the downstream task of segmenting real images via noising and subsequently denoising them. For the second experimental setting, we cast multiview reconstruction from unknown pose as a generative modeling problem. From a collection of unannotated 2D images of a scene, our approach simultaneously learns both a network to predict camera pose from 2D image input, as well as the parameters of a Neural Radiance Field (NeRF) for the 3D scene. To drive learning, we wrap both the pose prediction network and NeRF inside a Denoising Diffusion Probabilistic Model (DDPM) and train the system via the standard denoising objective. Our framework requires the system accomplish the task of denoising an input 2D image by predicting its pose and rendering the NeRF from that pose. Learning to denoise thus forces the system to concurrently learn the underlying 3D NeRF representation and a mapping from images to camera extrinsic parameters. To facilitate the latter, we design a custom network architecture to represent pose as a distribution, granting implicit capacity for discovering view correspondences when trained end-to-end for denoising alone. This technique allows our system to successfully build NeRFs, without pose knowledge, for challenging scenes where competing methods fail. At the conclusion of training, our learned NeRF can be extracted and used as a 3D scene model; our full system can be used to sample novel camera poses and generate novel-view images. Extensive experiments conducted as part of this thesis demonstrate the profound capability of our proposed factorized architecture and its integral structured computational bottleneck to address and resolve classical challenges in the field of computer vision, doing so end-to-end by purely learning to generate from unlabeled data. These experimental evaluations were rigorous and meticulously designed to test the versatility and robustness of our model under various scenarios, showcasing its ability to operate effectively across a broad spectrum of conditions without any dependency on labeled datasets. Specifically, the results of our experiments reveal that our model not only accomplishes accurate unsupervised image segmentation but also excels in generating high-quality synthetic images. This is evidenced across multiple datasets, where the model consistently performs with high precision and reliability, thereby indicating its suitability for diverse real-world applications. The ability of our model to segment images without supervision is particularly noteworthy, as it demonstrates a significant leap in the capability of generative models to understand and interpret complex visual data autonomously. Moreover, our research marks a pioneering advance in the field by being potentially the first to successfully tackle unsupervised pose estimation and 3D reconstruction within a diffusion-based framework for 360-degree scenes. This achievement is particularly significant, as it addresses a long-standing challenge in computer vision—achieving reliable 3D understanding from 2D inputs in an unsupervised manner. Our approach not only estimates the pose but also reconstructs the 3D geometry of the scene without any prior knowledge or external annotations, paving the way for new applications and improvements in areas such as virtual reality, augmented reality, and robotic navigation. These findings not only validate the efficacy of our novel generative architecture but also underscore its potential to transform the landscape of unsupervised learning in computer vision, opening up new avenues for research and application that were previously thought to be reliant on extensive labeled data. The success of these experiments thus provides a robust foundation for further exploration and development of unsupervised learning technologies in image and scene understanding.

Details

Actions

PDF

from
to
Export
Download Full History