Statistics and Its Interface

Volume 15 (2022)

Number 4

The more data, the better? Demystifying deletion-based methods in linear regression with missing data

Pages: 515 – 526

DOI: https://dx.doi.org/10.4310/21-SII717

Authors

Tianchen Xu (Mailman School of Public Health, Columbia University, New York, N.Y., U.S.A.)

Kun Chen (Department of Statistics, University of Connecticut, Storrs, Ct., U.S.A.)

Gen Li (School of Public Health, University of Michigan, Ann Arbor, Mich., U.S.A.)

Abstract

We compare two deletion-based methods for dealing with the problem of missing observations in linear regression analysis. One is the complete-case analysis (CC, or listwise deletion) that discards all incomplete observations and only uses common samples for ordinary least-squares estimation. The other is the available-case analysis (AC, or pairwise deletion) that utilizes all available data to estimate the covariance matrices and applies these matrices to construct the normal equation. We show that the estimates from both methods are asymptotically unbiased under missing completely at random (MCAR) and further compare their asymptotic variances in some typical situations. Surprisingly, using more data (i.e., AC) does not necessarily lead to better asymptotic efficiency in many scenarios. Missing patterns, covariance structure and true regression coefficient values all play a role in determining which is better. We further conduct simulation studies to corroborate the findings and demystify what has been missed or misinterpreted in the literature. Some detailed proofs and simulation results are available in the online supplemental materials.

Keywords

asymptotic variance, available-case analysis, complete-case analysis, missing data

2010 Mathematics Subject Classification

Primary 62Dxx, 62J05. Secondary 62F12.

The full text of this article is unavailable through your IP address: 172.17.0.1

Gen Li’s work was partially supported by the National Institutes of Health (grant number R01HG010731).

Kun Chen’s work is partially supported by National Science Foundation, Alexandria, Virginia (grant number IIS-1718798).

Received 26 June 2021

Accepted 14 December 2021

Published 4 March 2022