Statistics and Its Interface

Volume 8 (2015)

Number 4

Estimation of gene co-expression from RNA-Seq count data

Pages: 507 – 515

DOI: https://dx.doi.org/10.4310/SII.2015.v8.n4.a9

Authors

Alicia T. Specht (University of Notre Dame, Indiana, U.S.A.)

Jun Li (University of Notre Dame, Indiana, U.S.A.)

Abstract

Gene coexpression networks are widely used in understanding gene regulations, inferring gene functions, etc. The most straightforward way of constructing a coexpression network is to connect gene pairs whose expressions are highly correlated under different experimental conditions. Usually, this correlation is measured by the Pearson’s correlation coefficient, which, however, does not directly apply to data generated from RNA-Seq technique. RNA-Seq data are non-negative integers which cannot be properly modeled by a Gaussian distribution, and moreover, these counts have mean values that are proportional to the sequencing depths, and thus there are no identically distributed “replicates.” Directly normalizing counts by the corresponding sequencing depths and then using Pearson’s correlation coefficient can be of low efficiency. We propose a generalization of the Pearson’s correlation coefficient called iCC that can be directly applied to RNA-Seq data. On simulation data, iCC shows higher efficiency in distinguishing coexpressed gene pairs from unrelated gene pairs. In a real dataset, iCC generates a coexpression network that appears to more closely agree with experimentally validated networks than other methods. More generally, iCC can be used for calculating the correlation coefficient for any two series of random variables.

Keywords

Pearson’s correlation coefficient, RNA-Seq, coexpression network, count data, robust estimate

Published 19 October 2015