Statistics and Its Interface

Volume 9 (2016)

Number 4

Special Issue on Statistical and Computational Theory and Methodology for Big Data

Guest Editors: Ming-Hui Chen (University of Connecticut); Radu V. Craiu (University of Toronto); Faming Liang (University of Florida); and Chuanhai Liu (Purdue University)

Model diagnostics in reduced-rank estimation

Pages: 469 – 484

DOI: https://dx.doi.org/10.4310/SII.2016.v9.n4.a7

Author

Kun Chen (Department of Statistics, University of Connecticut, Storrs, Ct., U.S.A.)

Abstract

Reduced-rank methods are very popular in highdimensional multivariate analysis for conducting simultaneous dimension reduction and model estimation. However, the commonly-used reduced-rank methods are not robust, as the underlying reduced-rank structure can be easily distorted by only a few data outliers. Anomalies are bound to exist in big data problems, and in some applications they themselves could be of the primary interest. While naive residual analysis is often inadequate for outlier detection due to potential masking and swamping, robust reduced-rank estimation approaches could be computationally demanding. Under Stein’s unbiased risk estimation framework, we propose a set of tools, including leverage score and generalized information score, to perform model diagnostics and outlier detection in large-scale reduced-rank estimation. The leverage scores give an exact decomposition of the so-called model degrees of freedom to the observation level, which lead to exact decompositions of many commonly-used information criteria; the resulting quantities are thus named information scores of the observations. The proposed information score approach provides a principled way of combining the residuals and leverage scores for anomaly detection. Simulation studies confirm that the proposed diagnostic tools work well. A pattern recognition example with hand-writing digital images and a time series analysis example with monthly U.S. macroeconomic data further demonstrate the efficacy of the proposed approaches.

Keywords

big data, information score, model diagnostics, multivariate regression, outlier detection, reduced-rank estimation

2010 Mathematics Subject Classification

Primary 62M10. Secondary 62J12.

Published 14 September 2016