Statistics and Its Interface

Volume 11 (2018)

Number 2

Sparse Bayesian variable selection for classifying high-dimensional data

Pages: 385 – 395

DOI: https://dx.doi.org/10.4310/SII.2018.v11.n2.a14

Authors

Aijun Yang (College of Economics and Management, Nanjing Forestry University, Jiangsu, China; and Key Laboratory of Statistical Information Technology and Data Mining, State Statistics Bureau, Chengdu, China)

Heng Lian (Department of Mathematics, City University of Hong Kong, Kowloon Tong, Hong Kong)

Xuejun Jiang (Department of Mathematics, South University of Science and Technology of China, Shenzhen, China)

Pengfei Liu (School of Mathematics and Statistics, Jiangsu Normal University, Xuzhou, China)

Abstract

Identifying differentially expressed genes for classifying experiment classes is an important application of microarrays. Methods for selecting important genes are of much significance in accurate classification. Owing to the large number of genes and many of them are irrelevant, insignificant or redundant, standard statistical methods do not work well. The modification of existing methods is needed to achieve better analysis of microarray data. We present a stochastic variable selection approach for gene selection with different two level hierarchical prior distributions for regression coefficients. These priors can be used as a sparsity-enforcing mechanism to perform gene selection for classification. Using simulation-based MCMC methods for simulating parameters from the posterior distribution, an efficient algorithm is developed and implemented. This algorithm is robust to the choices of initial values, and produces posterior probabilities of related genes for biological interpretation. To highlight the potential applications of the proposed approach, we provide examples of the well-known colon cancer data and leukemia data in microarray literature.

Keywords

sparse priors, stochastic variable selection, classification, high-dimensional data

Supported by the grant of Natural Science Foundation of China (11501294, 11501261), China Postdoctoral Science Foundation (2015M580374, 2016T90398), Natural Science Foundation of Guangdong (2016A030313856), Jiangsu Qinglan Project(2017), Open Project Program of the Key Laboratory of Statistical Information Technology and Data Mining (SDL201704) and Project of Natural Science Research in Jiangsu Province (15KJB110007).

Received 1 March 2014

Published 7 March 2018