Evaluation of gene prediction software using a genomic data set: Application to Arabidopsis thaliana sequences

Nathalie Pavy, Stephane Rombauts*, Patrice Déhais, Catherine Mathé, Davuluri V V Ramana, Philippe Leroy, Pierre Rouzé

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

76 Scopus citations

Abstract

Motivation: The annotation of the Arabidopsis thaliana genome remains a problem in terms of time and quality. To improve the annotation process, we want to choose the most appropriate tools to use inside a computer-assisted annotation platform. We therefore need evaluation of prediction programs with Arabidopsis sequences containing multiple genes. Results: We have developed AraSet, a data set of contigs of validated genes, enabling the evaluation of multi-gene models for the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and the exon levels, new measures were introduced for the prediction at the protein sequence level as well as for the evaluation of gene models. This evaluation method is of general interest and could apply to any new gene prediction software and to any eukaryotic genome. The GeneMark.hmm program appears to be the most accurate software at all three levels for the Arabidopsis genomic sequences. Gene modeling could be further improved by combination of prediction software. Availability: The AraSet sequence set, the Perl programs and complementary results and notes are available at http://sphinx.rug.ac.be: 8080/biocomp/napav/. Contact: [email protected].

Original languageEnglish (US)
Pages (from-to)887-899
Number of pages13
JournalBioinformatics
Volume15
Issue number11
DOIs
StatePublished - Nov 1999

ASJC Scopus subject areas

  • Computational Mathematics
  • Molecular Biology
  • Biochemistry
  • Statistics and Probability
  • Computer Science Applications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Evaluation of gene prediction software using a genomic data set: Application to Arabidopsis thaliana sequences'. Together they form a unique fingerprint.

Cite this