Statistical Learning and Inference in Weakly Specified Settings: Shifted Distributions and Unlabeled Data

Gui, Yu

doi:10.6082/uchicago.14976

Statistical Learning and Inference in Weakly Specified Settings: Shifted Distributions and Unlabeled Data

Gui, Yu

2025

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Cite

Files

Abstract

This dissertation investigates statistical learning and inference in weakly specified settings, focusing on three key challenges that deviate from ideal statistical settings in canonical studies: distribution-free statistical inference, inference with distribution shifts, and representation learning with unlabeled data. As an unsupervised task, matrix completion has a wide range of applications across domains, for which it is of interest to quantify the uncertainty in filled entries. Leveraging the framework of weighted conformal prediction, we develop a method that constructs valid predictive intervals for missing entries in a matrix, ensuring robust inference without requiring assumptions on the matrix structure or noise distribution. Our approach provides theoretical guarantees on coverage while remaining agnostic to the choice of completion algorithm, making it broadly applicable in practical scenarios. As a generalization of matrix completion, which exhibits a shift from the sampled subpopulation to the unsampled subpopulation of interest, distribution shifts present fundamental challenges in statistical inference, as real-world data often exhibit discrepancies between training and test distributions. Standard methods based on sample reweighting can be sensitive to misspecifications, while traditional distributionally robust learning (DRL) techniques may be overly conservative. To address this, we propose a new framework that integrates shape constraints into DRL, leveraging structural properties of data to refine robustness guarantees. By imposing isotonic constraints on estimated density ratios, our approach mitigates the trade-off between reweighting errors and worst-case risk control, enabling more practical and theoretically sound solutions for learning under distribution shifts. Finally, motivated by the growing prevalence of unlabeled data in modern machine learning, we investigate the theoretical foundations of contrastive self-supervised learning (SSL). While SSL has demonstrated remarkable empirical success in extracting high-quality representations from high-dimensional data, its theoretical understanding remains incomplete. We provide a rigorous analysis of the statistical properties of representations learned through contrastive objectives, establishing guarantees on their generalization and utility in downstream tasks. Our results reveal key mechanisms underlying the effectiveness of contrastive learning, offering new insights into its advantages in high-dimensional settings. Together, the contributions of this dissertation advance the methodological developments and theoretical foundations of statistical learning in settings where traditional assumptions are weakened or absent, offering principled approaches for inference and theoretical guarantees for representation learning in modern data science applications.