Statistics and Its Interface

Volume 10 (2017)

Number 2

REC: fast sparse regression-based multicategory classification

Pages: 175 – 185

DOI: https://dx.doi.org/10.4310/SII.2017.v10.n2.a2

Authors

Chong Zhang (Department of Statistics and Actuarial Science, University of Waterloo, Ontario, Canada)

Xiaoling Lu (Center for Applied Statistics, School of Statistics, Renmin University of China)

Zhengyuan Zhu (Department of Statistics, Iowa State University, Ames, Ia., U.S.A.)

Yin Hu (Sage Bionetworks, U.S.A.)

Darshan Singh (Department of Computer Science, University of North Carolina, Chapel Hill, N.C., U.S.A.)

Corbin Jones (Department of Computer Science, University of North Carolina, Chapel Hill, N.C., U.S.A.)

Jinze Liu (Department of Computer Science, University of Kentucky, Lexington, Ky., U.S.A.)

Jan F. Prins (Department of Computer Science, University of North Carolina, Chapel Hill, N.C., U.S.A.)

Yufeng Liu (Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, N.C., U.S.A.)

Abstract

Recent advance in technology enables researchers to gather and store enormous data sets with ultra high dimensionality. In bioinformatics, microarray and next generation sequencing technologies can produce data with tens of thousands of predictors of biomarkers. On the other hand, the corresponding sample sizes are often limited. For classification problems, to predict new observations with high accuracy, and to better understand the effect of predictors on classification, it is desirable, and often necessary, to train the classifier with variable selection. In the literature, sparse regularized classification techniques have been popular due to the ability of simultaneous classification and variable selection. Despite its success, such a sparse penalized method may have low computational speed, when the dimension of the problem is ultra high. To overcome this challenge, we propose a new sparse REgression based multicategory Classifier (REC). Our method uses a simplex to represent different categories of the classification problem. A major advantage of REC is that the optimization can be decoupled into smaller independent sparse penalized regression problems, and hence solved by using parallel computing. Consequently, REC enjoys an extraordinarily fast computational speed. Moreover, REC is able to provide class conditional probability estimation. Simulated examples and applications on microarray and next generation sequencing data suggest that REC is very competitive when compared to several existing methods.

Keywords

LASSO, parallel computing, probability estimation, simplex, variable selection

Published 31 October 2016