Files
Abstract
High-dimensional statistics focuses on data whose ambient dimension is extremely large relative to the number of data points. In such regimes, classical asymptotic theory and traditional estimators often break down. This thesis proposes and analyzes new estimation procedures for two statistical settings where the number of parameters to be estimated is large but there exist special sparsity structures that can be exploited. Part I of this thesis studies the estimation of the topic-word matrix under the probabilistic Latent Semantic Indexing model. Here, we assume that the ordered entries of the topic-word matrix's columns rapidly decay to zero; this assumption is partly motivated by the empirical observation that word frequencies in a text often follow Zipf's law. We introduce a new spectral estimation procedure that thresholds words based on their corpus frequencies, and show that its l1 error rate depends on the vocabulary size p only via a logarithmic term. The vocabulary size p is typically very large in practice, and prior works have not adequately addressed this high-dimensional data regime. Part II of this thesis studies regression problems where the feature vectors are indexed by the vertices of a given graph. We propose a generalization of the Elastic Net penalty that is applicable when the true signal is believed to be smooth or piecewise constant with respect to the given graph. We derive the estimation and prediction error bounds under the assumptions of correlated Gaussian design and network alignment, and also provide a coordinate descent procedure based on the Lagrange dual objective to facilitate computations for large-scale problems.