Single-gene negative binomial regression models for RNA-Seq data with higher-order asymptotic inference

We consider negative binomial (NB) regression models for RNA-Seq read counts and investigate an approach where such NB regression models are fitted to individual genes separately and, in particular, the NB dispersion parameter is estimated from each gene separately without assuming commonalities between genes. This single-gene approach contrasts with the more widely-used dispersion-modeling approach where the NB dispersion is modeled as a simple function of the mean or other measures of read abundance, and then estimated from a large number of genes combined. We show that through the use of higher-order asymptotic techniques, inferences with correct type I errors can be made about the regression coefficients in a single-gene NB regression model even when the dispersion is unknown and the sample size is small. The motivations for studying singlegene models include: 1) they provide a basis of reference for understanding and quantifying the power-robustness tradeoffs of the dispersion-modeling approach; 2) they can also be potentially useful in practice if moderate sample sizes become available and diagnostic tools indicate potential problems with simple models of dispersion.

Keywords

RNA-seq, higher-order asymptotics, negative binomial, regression, overdispersion, extra-Poisson variation, power-robustness

2010 Mathematics Subject Classification

Primary 62P10. Secondary 92D20.

Full Text (PDF format)

Published 19 October 2015