Statistics and Its Interface

Volume 3 (2010)

Number 4

Predicting kinase functional sites using hierarchical stochastic language modelling

Pages: 523 – 531

DOI: https://dx.doi.org/10.4310/SII.2010.v3.n4.a10

Authors

Minghua Deng (LMAM, School of Mathematical Sciences, Peking University, Beijing, China)

Xiangzhong Fang (LMAM, School of Mathematical Sciences, Peking University, Beijing, China)

Peng Ge (Center for Theoretical Biology, Peking University, Beijing, China)

Luhua Lai (Center for Theoretical Biology, Peking University, Beijing, China)

Guojun Pei (LMAM, School of Mathematical Sciences, Peking University, Beijing, China)

Minping Qian (LMAM, School of Mathematical Sciences, Peking University, Beijing, China)

Fengzhu Sun (Molecular and Computational Biology Program, University of Southern California at Los Angeles)

Huan Yu (LMAM, School of Mathematical Sciences, Peking University, Beijing, China)

Abstract

Motivation: Predicting functional sites in kinases is an important problem in biology. Both the functional sites and the relationship among the amino acids within the sites need to be understood. An algorithm is developed for kinase functional site prediction using amino acid sequence data based on hierarchical stochastic language (HSL) modelling.

Results: Our method is validated by using two complementary approaches. Firstly, the predicted functional sites using the HSL were compared with experimentally verified functional sites including the patterns in PROSITE, the contacting sites in the Protein Data Bank (PDB), and the domains in Pfam. Compared to the patterns in PROSITE and the contacting sites in PDB, the overall average recall/precision of the HSL model was 83.5% / 23.0% and 66.1% / 79.9%, respectively. Compared to Pfam, 90% of the predicted functional sites were parts of domains with names containing the substring “kinase”. Secondly, 10-fold cross-validation was used to study the kinase function prediction accuracy of the HSL. The HSL achieved both high sensitivity (94.7%) and specificity (94.0%) compared to 94.5% and 85.8%, respectively, for MEME. The HSL model automatically detected kinase sub-families. The identified sub-families were consistent with known phylogenetic trees of the kinase sequences. Therefore, the HSL was applicable to kinase sequences with heterogeneous subsets sharing the same catalysis function.

Availability and Supplementary information: The software and supplementary materials are available at http://www.math.pku.edu.cn/teachers/dengmh/HSL

Keywords

kinase, functional sites, hierarchical stochastic language (HSL)

Published 1 January 2010