A top-down approach to classify enzyme functional classes and sub-classes using random forest

Chetan Kumar*, Alok Choudhary

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

38 Scopus citations

Abstract

Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class ofenzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes.

Original languageEnglish (US)
Article number1
JournalEurasip Journal on Bioinformatics and Systems Biology
Volume2012
Issue number1
DOIs
StatePublished - 2012

Funding

This study was supported in part by the NSF grants CNS-0551639, IIS-0536994, NSF HECURA CCF-0621443, and NSF SDCI OCI-0724599 and DOE SCIDAC-2: Scientific Data Management Center for Enabling Technologies (CET) grant DE-FC02-07ER25808 and DOE DE-FG02-08ER25848/A000.

ASJC Scopus subject areas

  • Computational Mathematics
  • General Biochemistry, Genetics and Molecular Biology
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'A top-down approach to classify enzyme functional classes and sub-classes using random forest'. Together they form a unique fingerprint.

Cite this