Statistics and Its Interface

Volume 11 (2018)

Number 3

Robust model-free feature screening based on modified Hoeffding measure for ultra-high dimensional data

Pages: 473 – 489

DOI: https://dx.doi.org/10.4310/SII.2018.v11.n3.a10

Authors

Yuan Yu (School of Statistics, Shandong University of Finance and Economics, Jinan, China; and School of Statistics and Management, Shanghai University of Finance & Economics, Shanghai, China)

Di He (School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China)

Yong Zhou (Institute of Statistics and Interdisciplinary Sciences and the School of Statistics, Faculty of Economics and Management, East China Normal University, Shanghai, China)

Abstract

Sure independence screening (SIS) has become a cutting-edge dimension reduction technique to extract important features from ultrahigh-dimensional data in statistical learning. Many of the screening methods are developed to be suitable for special models that follow certain assumptions. With the availability of more data types and complicated models, a robust model-free procedure with less restrictive conditions of data is required. In this paper, we propose a modified Hoeffding measure which efficiently characterize the dependence between two random variables. The modified Hoeffding measure is between $0$ and $1$, and zero if and only if the two variables are independent under some mild conditions. This property enables us to propose a novel feature screening procedure based on it without specifying the regression structure. The proposed method is robust for both the predictors and response with the heavy-tailed data and outliers, and suitable for complex data including discrete and multivariate variables. In addition, it can extract important features even when the underlying model is complicated. We further establish the sure screening property and ranking consistency property even when the dimensionality is an exponential order of the sample size without assuming any moment condition on the predictors and response. Simulations and an analysis of real data demonstrate the versatility and practicability of the proposed method in comparison with other state-of-the-art approaches.

Keywords

feature screening, Hoeffding measure, ranking consistency property, robustness, sure screening property, ultrahigh-dimensional data

2010 Mathematics Subject Classification

62E99, 62G05, 62G35, 62H20, 62P10

Yu’s work was supported by Graduate Innovation Foundation of Shanghai University of Finance and Economics, China (2015110758).

He’s work was supported by Graduate Innovation Foundation of Shanghai University of Finance and Economics, China (CXJJ-2014-452).

Zhou’s work was supported by the State Key Program of National Natural Science Foundation of China (71331006), the State Key Program in the Major Research Plan of National Natural Science Foundation of China (91546202).

Received 7 December 2016

Published 17 September 2018