Statistical issues in binding site identification through CLIP-seq

Giovanni Stefani (Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut, U.S.A.; and Centre for Integrative Biology, University of Trento, Italy)

Frank J. Slack (Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut, U.S.A.)

Hongyu Zhao (Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, U.S.A.; and Departments of Biostatistics and Genetics, Yale School of Public Health, New Haven, Conn., U.S.A.)

Abstract

With the advent and development of CLIP-seq technologies, a growing number of CLIP-seq experiments are being performed to identify the targets of RNA-binding proteins and understand the regulation mechanism of these proteins. Although broad similarities exist between CLIP-seq and ChIP-seq, statistical methods developed to identify binding sites from ChIP-seq data are not directly applicable to CLIP-seq data because of some differences between the two technologies. First, transcript abundance has a large impact on CLIP-seq results, and needs to be accounted for when analyzing CLIP-seq data. Second, mutations near the binding sites from CLIP-seq data offer valuable information that can be incorporated in analysis. Other differences arise from the ability of RNA to form complex secondary structures and from many other technical aspects of the two purification protocols. To date, no systematic studies have been conducted to investigate the general statistical properties of CLIP-seq data, the merits of including RNA-seq as a matching control, and the performance of different binding site identification methods for CLIP-seq data. In this study, we performed a comprehensive evaluation of various statistical issues in using CLIP-seq data to identify RNA-protein binding sites. We demonstrate the value of RNA-seq data in background estimation and peak calling. We show that the large dispersion in CLIP-seq data compared to ChIP-seq data is the main reason for the difficulty in peak calling in the former. Using both real and simulated data, we also show the importance of biological/technical replicates and of combining mutation and peak analysis to accurately identify binding sites from CLIP-seq data.

Full Text (PDF format)

Published 19 October 2015