Statistics and Its Interface

Volume 14 (2021)

Number 4

A residual-based approach for robust random forest regression

Pages: 389 – 402

DOI: https://dx.doi.org/10.4310/20-SII660

Authors

Andrew J. Sage (Department of Mathematics, Statistics, and Computer Science, Lawrence University, Appleton, Wisconsin, U.S.A.)

Ulrike Genschel (Department of Statistics, Iowa State University, Ames, Ia., U.S.A.)

Dan Nettleton (Department of Statistics, Iowa State University, Ames, Ia., U.S.A.)

Abstract

We introduce a novel robust approach for random forest regression that is useful when the conditional distribution of the response variable, given predictor values, is contaminated. Residual analysis is used to identify unusual response values in training data, and the contributions of these values are down-weighted accordingly. This approach is motivated by a robust fitting procedure first proposed in the context of locally weighted polynomial regression and scatterplot smoothing. We demonstrate that tuning the parameter in the robustness algorithm using a weighted crossvalidation approach is advantageous when contamination is suspected in training data responses. We conduct extensive simulations, comparing our method to existing robust approaches, some of which have not been compared to one another in prior studies. Our approach outperforms existing techniques on noisy training datasets with response contamination. While no approach is uniformly optimal, ours is consistently competitive with the best existing approaches for robust random forest regression.

Keywords

data contamination, robustness, random forest

2010 Mathematics Subject Classification

Primary 62G35. Secondary 62G08.

The full text of this article is unavailable through your IP address: 172.17.0.1

Received 27 January 2020

Accepted 18 December 2020

Published 8 July 2021