Contents Online
Statistics and Its Interface
Volume 14 (2021)
Number 3
Residual-based tree for clustered binary data
Pages: 295 – 308
DOI: https://dx.doi.org/10.4310/20-SII638
Authors
Abstract
Tree-based methods are widely used for classification in health sciences research, where data are often clustered. In this paper, we propose a variant of the standard classification and regression tree paradigm (CART) to handle clustered binary outcomes. Using residuals from a null generalized linear mixed model as the response, we build a regression tree to partition the covariate space into rectangles. This circumvents modeling the correlation structure explicitly while still accounting for the cluster-correlated design, thereby allowing us to adopt the standard CART machinery in tree growing, pruning, and cross-validation. Class predictions for each terminal node in the final tree are estimated based on the success probabilities within the specific node. Our method also allows easy extension to ensemble of trees and random forest. Using extensive simulations, we compare our residual-based trees to the standard classification tree. Finally, the methods are illustrated using data from a study of kidney cancer and a study of surgical mortality after colectomy.
Keywords
clustered data, classification, tree-based methods, residuals, kidney cancer, colectomy surgical mortality
Received 2 December 2019
Accepted 16 September 2020
Published 9 February 2021