We derive the asymptotic distributions of the spiked eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the magnitude of spiked eigenvalues, ...sample size and dimensionality. This regime allows high dimensionality and diverging eigenvalues and provides new insights into the roles that the leading eigenvalues, sample size and dimensionality play in principal component analysis. Our results are a natural extension of those in Statist. Sinica 17 (2007) 1617–1642 to a more general setting and solve the rates of convergence problems in Statist. Sinica 26 (2016) 1747–1770. They also reveal the biases of estimating leading eigenvalues and eigenvectors by using principal component analysis, and lead to a new covariance estimator for the approximate factor model, called Shrinkage Principal Orthogonal complEment Thresholding (S-POET), that corrects the biases. Our results are successfully applied to outstanding problems in estimation of risks for large portfolios and false discovery proportions for dependent test statistics and are illustrated by simulation studies.
Challenges of Big Data analysis Fan, Jianqing; Han, Fang; Liu, Han
National science review,
06/2014, Letnik:
1, Številka:
2
Journal Article
Recenzirano
Odprti dostop
Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand,Big Data hold great promises for discovering subtle population paterns and heterogeneities that ...are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage botleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors.hese challenges are distinguished and require new computational and statistical paradigm. his paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-conidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. hey can lead to wrong statistical inferences and consequently wrong scientiic conclusions.
Ultrahigh-dimensional variable selection plays an increasingly important role in contemporary scientific discoveries and statistical research. Among others, Fan and Lv J. R. Stat. Soc. Ser. B Stat. ...Methodol. 70 (2008) 849—911 propose an independent screening framework by ranking the marginal correlations. They showed that the correlation ranking procedure possesses a sure independence screening property within the context of the linear model with Gaussian covariates and responses. In this paper, we propose a more general version of the independent learning with ranking the maximum marginal likelihood estimates or the maximum marginal likelihood itself in generalized linear models. We show that the proposed methods, with Fan and Lv J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008) 849—911 as a very special case, also possess the sure screening property with vanishing false selection rate. The conditions under which the independence learning possesses a sure screening is surprisingly simple. This justifies the applicability of such a simple method in a wide spectrum. We quantify explicitly the extent to which the dimensionality can be reduced by independence screening, which depends on the interactions of the covariance matrix of covariates and true parameters. Simulation studies are used to illustrate the utility of the proposed approaches. In addition, we establish an exponential inequality for the quasi-maximum likelihood estimator which is useful for high-dimensional statistical learning.
Noisy matrix completion aims at estimating a low-rank matrix given only partial and corrupted entries. Despite remarkable progress in designing efficient estimation algorithms, it remains largely ...unclear how to assess the uncertainty of the obtained estimates and how to perform efficient statistical inference on the unknown matrix (e.g., constructing a valid and short confidence interval for an unseen entry). This paper takes a substantial step toward addressing such tasks. We develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting debiased estimators admit nearly precise nonasymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals/regions for, say, the missing entries and the low-rank factors. Our inferential procedures do not require sample splitting, thus avoiding unnecessary loss of data efficiency. As a byproduct, we obtain a sharp characterization of the estimation accuracy of our debiased estimators in both rate and constant. Our debiased estimators are tractable algorithms that provably achieve full statistical efficiency.
The paper deals with the estimation of a high dimensional covariance with a conditional sparsity structure and fast diverging eigenvalues. By assuming a sparse error covariance matrix in an ...approximate factor model, we allow for the presence of some cross-sectional correlation even after taking out common but unobservable factors. We introduce the principal orthogonal complement thresholding method 'POET' to explore such an approximate factor structure with sparsity. The POET-estimator includes the sample covariance matrix, the factor-based covariance matrix, the thresholding estimator and the adaptive thresholding estimator as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high dimensional data. The rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms. It is shown that the effect of estimating the unknown factors vanishes as the dimensionality increases. The uniform rates of convergence for the unobserved factors and their factor loadings are derived. The asymptotic results are also verified by extensive simulation studies. Finally, a real data application on portfolio allocation is presented.
This paper studies the sparsistency and rates of convergence for estimating sparse covariance and precision matrices based on penalized likelihood with nonconvex penalty functions. Here, sparsistency ...refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one. Depending on the case of applications, sparsity priori may occur on the covariance matrix, its inverse or its Cholesky decomposition. We study these three sparsity exploration problems under a unified framework with a general penalty function. We show that the rates of convergence for these problems under the Frobenius norm are of order (s n log p n /n) 1/2 , where s n is the number of nonzero elements, p n is the size of the covariance matrix and n is the sample size. This explicitly spells out the contribution of high-dimensionality is merely of a logarithmic factor. The conditions on the rate with which the tuning parameter λ n goet to 0 have been made explicit and compared under different penalties. As a result, for the L₁-penalty, to guarantee the sparsistency and optimal rate of convergence, the number of nonzero elements should be small: $s_{n}^{\prime}=O(p_{n})$ at most, among $O(p_{n}^{2})$ parameters, for estimating sparse covariance or correlation matrix, sparse precision or inverse correlation matrix or sparse Cholesky factor, where $s_{n}^{\prime}$ is the number of the nonzero elements on the off-diagonal entries. On the other hand, using the SCAD or hard-thresholding penalty functions, there is no such a restriction.
High-frequency financial data allow us to estimate large volatility matrices with relatively short time horizon. Many novel statistical methods have been introduced to address large volatility matrix ...estimation problems from a high-dimensional Itô process with microstructural noise contamination. Their asymptotic theories require sub-Gaussian or some finite high-order moments assumptions for observed log-returns. These assumptions are at odd with the heavy tail phenomenon that is pandemic in financial stock returns and new procedures are needed to mitigate the influence of heavy tails. In this article, we introduce the Huber loss function with a diverging threshold to develop a robust realized volatility estimation. We show that it has the sub-Gaussian concentration around the volatility with only finite fourth moments of observed log-returns. With the proposed robust estimator as input, we further regularize it by using the principal orthogonal component thresholding (POET) procedure to estimate the large volatility matrix that admits an approximate factor structure. We establish the asymptotic theories for such low-rank plus sparse matrices. The simulation study is conducted to check the finite sample performance of the proposed estimation methods.
Folded concave penalization methods have been shown to enjoy the strong oracle property for high-dimensional sparse estimation. However, a folded concave penalization problem usually has multiple ...local solutions and the oracle property is established only for one of the unknown local solutions. A challenging fundamental issue still remains that it is not clear whether the local optimum computed by a given optimization algorithm possesses those nice theoretical properties. To close this important theoretical gap in over a decade, we provide a unified theory to show explicitly how to obtain the oracle solution via the local linear approximation algorithm. For a folded concave penalized estimation problem, we show that as long as the problem is localizable and the oracle estimator is well behaved, we can obtain the oracle estimator by using the one-step local linear approximation. In addition, once the oracle estimator is obtained, the local linear approximation algorithm converges, namely it produces the same estimator in the next iteration. The general theory is demonstrated by using four classical sparse estimation problems, that is, sparse linear regression, sparse logistic regression, sparse precision matrix estimation and sparse quantile regression.
The variance-covariance matrix plays a central role in the inferential theories of high-dimensional factor models in finance and economics. Popular regularization methods of directly exploiting ...sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu J. Amer. Statist. Assoc. 106 (2011) 672-684, taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied.
Variable selection plays an important role in high dimensional statistical modelling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or ...dimensionality p, accuracy of estimation and computational cost are two top concerns. Recently, Candes and Tao have proposed the Dantzig selector using L₁-regularization and showed that it achieves the ideal risk up to a logarithmic factor log(p). Their innovative procedure and remarkable result are challenged when the dimensionality is ultrahigh as the factor log(p) can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method that is based on correlation learning, called sure independence screening, to reduce dimensionality from high to a moderate scale that is below the sample size. In a fairly general asymptotic framework, correlation learning is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, iterative sure independence screening is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a well-developed method such as smoothly clipped absolute deviation, the Dantzig selector, lasso or adaptive lasso. The connections between these penalized least squares methods are also elucidated.