Files

Abstract

We discuss problems where we have limited access to the information of underlying distribution of training data, which can be caused by imperfect data or insufficient prior knowledge.We first look into the binary classification problem, in the setting where the label observations are corrupted by noise. We establish that corruption acts as a form of regularization, and we compute precise upper bounds on estimation error in the presence of corruption. Our results suggest that the presence of corrupted data points is beneficial up to a small fraction of the total sample, scaling with the square root of the sample size. Next, we study the regression problem in the distribution-free setting. We show that there are three regimes in terms of the possibility of meaningful inference, which are characterized by the `effective support size' of the feature distribution. Our result implies that there exists a counterintuitive in-between regime where we can still expect to obtain meaningful inference for a future input even when it is unlikely to have a value we have observed before. We also develop distribution-free methods for predictive inference with hierarchically structured datasets. For the special case where we have i.i.d. repeated measurements, we propose to bound the expected squared conditional miscoverage rate in order to have a better control of the conditional coverage, and extend existing methods to construct distribution-free prediction sets that achieve the bound.

Details

Actions

PDF

from
to
Export
Download Full History