Open Access Open Access  Restricted Access Subscription or Fee Access

PSSM Amino-Acid Composition Based Gene Identification Using Support Vector Machines

Heena Farooq Bhat, M. Arif Wani



The main characteristic of identifying the molecular mechanism of the cell is to understand the significance or function of each protein encoded in the genome. For that purpose, genome annotation proves to be very supportive. One of the most obligatory phases of genome annotation is the prediction of the genes. Several methods or techniques have been developed in order to locate or predict the patterns of genes in genome sequence. However, still the recognition of genes is found to be a very complicated problem. Recognizing the corresponding gene of a given protein sequence by means of conventional tools is error prone. Hence, the recognition of genes is a very demanding task. In this paper, we first concentrate on the problem of gene prediction and its challenges. We then present a new method for identifying genes. This new method follows a two-step procedure. Firstly, we present new features extracted from protein sequences and these features are derived from a position specific scoring matrix (PSSM). The PSSM profiles are converted into uniform numeric representation. Finally, the PSSM vectors are given as an input to SVM for classification purpose. This new method has been demonstrated on genome DNA set dataset. It is shown that the experimental results of new approach produces better results.

Keywords: Gene prediction, classification, feature extraction, binding proteins, rule induction, position specific scoring matrix

Cite this Article

Heena Farooq Bhat, Arif Wani M. PSSM Amino-Acid Composition Based Gene Identification Using Support Vector Machines. Journal of Artificial Intelligence Research & Advances. 2019; 6(1): 50–58p.


Gene Prediction; Classification; Feature Extraction; Binding proteins; Position Specific Scoring Matrix.

Full Text:



Wani, M. A. (2008) ‘Incremental hybrid approach for microarray classification’, Proceedings of the Seventh International Conference on Machine Learning and Applications, pp. 514-520.

Wani, M. A. (2011) ‘Microarray classification using sub-space grids’, Proceedings of the Tenth International Conference on Machine Learning and Applications, Vol. 1, pp, 389-394.

Wani, M. A. (2012) ‘Introducing subspace grids to recognise patterns in multidimensinal data’, International Conference on Machine Learning and Applications, Vol. 1, pp. 33-39.

Wani, M. A., and Yesilbudak, M. (2013) ‘Recognition of wind speed patterns using multi-scale subspace grids with decision trees’, International Journal of Renewable Energy Research (IJRER), Vol. 3 No. 2, pp. 458-462.

Wani, M. A. (2001) ‘SAFARI: a structured approach for automatic rule’, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 31 No. 4, pp. 650-657.

Goel, N., Singh, S., & Aseri, T. C. (2013). A comparative analysis of soft computing techniques for gene prediction. Analytical biochemistry, 438(1), 14-21.

Bhat, H. F., and Wani, M. A. (2013). Modified one-against-all algorithm based on support vector machine. International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 3, Issue 12, pp. 972-975.

Bhat, H. F., and Wani, M. A. (2014). A Comparative Study of Five Main Support Vector Machine Based Multiclass Classification Algorithms. International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE).Volume 1, Issue 2, pp. 35-45.

Wani, M. A. (2013) "Hybrid Method for Fast SVM Training in Applications Involving Large Volumes of Data," 2013 12th International Conference on Machine Learning and Applications, Miami, FL, pp. 491-494.

Wani, M. A., and Bhat, H. F. (2017) "Multiclass SVM algorithms for wind speed prediction,"International Conference on Renewable Energy Research and Applications (ICRERA), pp. 1139-1143.

Khan, A. I., and Wani, M. A. (2015). Efficient and Rotation Invariant Fingerprint Matching Algorithm Using Adjustment Factor. In Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on (pp. 1103-1110). IEEE.

Bhat, F. A., and Wani, M. A. (2014). Performance Comparison of Major Classical Face Recognition Techniques. In Machine Learning and Applications (ICMLA), 2014 13th International Conference on (pp. 521-528). IEEE.

Mujtaba, T., and Wani, M. A. (2017). Daily Global Horizontal Solar Radiation Forecasting Using Extreme Learning Machines. 4th International Conference on “Computing for Sustainable Global Development”, (INDIACom), pp. 7290-7295, IEEE.

Bhat, M. R., and Wani, M. A. (2017). Evaluating Algebraic Model Based Information Retrieval Algorithms for Small and Large Data set. 4th International Conference on “Computing for Sustainable Global Development”, (INDIACom), IEEE.

Bhat, H. F., and Wani, M. A. (2017). Algorithms for Sequence Alignment. 4th International Conference on “Computing for Sustainable Global Development”, (INDIACom), pp. 4231-4236, IEEE.

Mathé, C., Sagot, M. F., Schiex, T., & Rouzé, P. (2002). Current methods of gene prediction, their strengths and weaknesses. Nucleic acids research, 30(19), 4103-4117.

Xu, Y., Mural, R.J., Einstein, J.R., Shah,M.B., Uberbacher, E.C. (1996): GRAIL: a multi-agent neural network system for gene identification. Proc. IEEE 84, 1544–1552.

Krogh, A.(2000). Using database matches with HMMGene for automated gene detection in Drosophila. Genome Res. 10, 523–528.

Burge, C., Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94.

Yeh, R.F., Lim, L.P., Burge, C.B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816.

Wani, M. A. (2012). Introducing Subspace Grids to Recognize Patterns in Multidimensional Data. Proceedings - 2012 11th International Conference on Machine Learning and Applications, ICMLA 2012. 1. 33-39. 10.1109/ICMLA.2012.15.

Klasberg, S., Bitard-Feildel, T., & Mallet, L. (2016). Computational identification of novel genes: current and future perspectives. Bioinformatics and Biology insights, 10, 121.

Goel, N., Singh, S., & Aseri, T. C. (2013). A review of soft computing techniques for gene prediction. ISRN Genomics, 2013.

R. D. Sleator. (2010). “An overview of the current status of eukaryote gene prediction strategies,” Gene, vol. 461, no. 1-2, pp. 1–4.

M. Yandell and D. Ence. (2012). “A beginner’s guide to eukaryotic genome annotation,” Nature Reviews, vol. 13, pp. 329–342.

GeneScan Web server is available at

Guigo, R., Knudsen, S., Drake, N., and Smith, T.F. (1992). Prediction of gene structure. J. Mol. Biol. 226: 141-157.

A.A. Salamov, V.V. Solovyev. (2000). Ab initio gene finding in Drosophila genomic DNA, Genome Res. 10, pp. 391–393.

M. Stanke, R. Steinkamp, S. Waack, B. Morgenstern, AUGUSTUS. (2004) a webserver for gene finding in eukaryotes, Nucleic Acids Res. 32, w309–w312.

S.F. Altschul et al., “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, pp. 3389-3402, Sept. 1997.

S.F. Altschul and E.V. Koonin, “Iterated Profile Searches with PSIBLAST—A Tool for Discovery in Protein Databases,” Trends Biochemical Sciences, vol. 23, pp. 444-447, Nov. 1998.

M. Gribskov et al., “Profile Analysis: Detection of Distantly Related Proteins,” Proc. Nat’l Academy of Sciences USA, vol. 84,pp. 4355-4358, July 1987.

Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995; 20:273–97.

Burges CJC. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min Knowl Discov 1997; 2:121–67.

Liu, T., Zheng, X., & Wang, J. (2010). Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile. Biochimie, 92(10), 1330-1334.

Liu, Y., Guo, J., Hu, G., & Zhu, H. (2013, April). Gene prediction in meta-genomic fragments based on the SVM algorithm. In BMC bioinformatics (Vol. 14, No. 5, p. S12). BioMed Central.

K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181.201, 2001.

X. Ma, J.Wu, and X. Xue, “Identification of DNA-binding proteins using support vector machine with sequence information,” Computational and Mathematical Methods in Medicine, vol. 2013, Article ID 524502, 8 pages, 2013.

S. Muthukrishnan, M. Puri, and C. Lefevre, “Support vector machine (SVM) based multiclass prediction with basic statistical analysis of plasminogen activators,” BMC Research Notes, vol. 7, article 63, 2014.

Kumar, M., Gromiha, M. M., & Raghava, G. P. (2007). Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC bioinformatics, 8(1), 463.


  • There are currently no refbacks.