Statistics and Its Interface

Volume 2 (2009)

Number 3

Support vector machines with disease-gene-centric network penalty for high dimensional microarray data

Pages: 257 – 269

DOI: https://dx.doi.org/10.4310/SII.2009.v2.n3.a1

Authors

Wei Pan (Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minn., U.S.A.)

Xiaotong Shen (School of Statistics, University of Minnesota, Minneapolis, Minn., U.S.A.)

Yanni Zhu (Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minn., U.S.A.)

Abstract

With the availability of gene pathways or networks and accumulating knowledge on genes with variants predisposing to diseases (disease genes), we propose a disease-gene-centric support vector machine (DGC-SVM) that directly incorporates these two sources of prior information into building microarray-based classifiers for binary classification. DGC-SVM aims to detect genes clustering together and around some key disease genes in a gene network. Toward this end, we propose a penalty over suitably defined groups of genes. A hierarchy is imposed on an undirected gene network to facilitate the definition of such gene groups. Our proposed DGC-SVM utilizes the hinge loss penalized by a sum of the $L_{\infty}$-norm over each group. The simulation studies show that DGC-SVM not only detects more disease genes along pathways than the existing standard-SVM and SVM with an $L_1$-penalty (L1-SVM), but also captures disease genes that potentially affect the outcome only weakly. Two real data applications demonstrate that DGC-SVM improves gene selection while retaining predictive performance of the standard-SVM and L1-SVM. The proposed method has the potential to be an effective classification tool that encourages gene selection along paths to or clustering around known disease genes for microarray data.

Keywords

DAG, gene expression, gene network, grouped penalty, hierarchy, penalization

Published 1 January 2009