Communications in Information and Systems

Volume 20 (2020)

Number 1

Similarity analysis of protein sequences using a reduced $k$-mer amino acid model

Pages: 45 – 60

DOI: https://dx.doi.org/10.4310/CIS.2020.v20.n1.a3

Authors

Jia Wen (School of Information Engineering, Suihua University, Suihua, China)

Yuyan Zhang (School of Agriculture and Hydraulic Engineering, Suihua University, Suihua, China)

Huanxu Wang (School of Information Engineering, Suihua University, Suihua, China)

Abstract

Based on the properties of amino acid side chain, the 20 natural amino acids are divided into a simplified feature space, and the original protein sequence could be represented by a reduced amino acid sequence, which contains only four residues. Associating with this reduced protein sequence representation, the $k$‑mer natural vector is defined and utilized to describe the similarity analysis of protein sequences, in which the frequencies and positional information of $k$‑mers appearing in a reduced amino acid sequence are characterized by a feature vector. The similarity analysis of protein sequences can be easily and fast performed without requiring evolutionary models or human intervention. In order to show the utilities of our new method, it is applied on the real protein datasets for similarity analysis, and the obtaining results demonstrate that our new approach can precisely describe the similarities of protein sequences, and also strengthen the computing efficiency, compared with multiple sequence alignment. Therefore, our reduced $k$‑mer amino acid representation model is a very powerful tool for analyzing and annotating protein sequence.

Keywords

similarity analysis, protein sequence, a reduced amino acid model, $k$-mer natural vector, multiple sequence alignment

This work is partially supported by Scientific Research Funding of Suihua University (K1501009, 2017-XGYYWF-017), and by Natural Scientific Research Funding of Heilongjiang (LH2019A031).

Received 5 September 2019

Published 17 April 2020