Derived Distribution Points heuristic for fast pairwise statistical significance estimation

Ankit Agrawal*, Alok Choudhary, Xiaoqiu Huang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

Estimation of statistical significance of a pairwise sequence alignment is crucial in homology detection. A recent development in the field is the use of pairwise statistical significance as an alternative to database statistical significance. Although pairwise statistical significance has been shown to be potentially better than database statistical significance in terms of homology detection retrieval accuracy, currently it is much time consuming since it involves generating an empirical score distribution by aligning one sequence of the sequence-pair with N random shuffles of the other sequence. A high value of N produces (statistically and potentially biologically) accurate estimates, but also consumes more time. A low value of N leads to inaccurate fitting of the score distribution, and hence poor estimates of statistical significance. In this paper, we propose a simple heuristic, called the Derived Distribution Points (DDP) heuristic, which is designed taking into account the features of the pairwise statistical significance estimation procedure, and has shown to significantly improve the quality of pairwise statistical significance estimates (evaluated in terms of retrieval accuracy) even when using low values of N. Alternatively, it can be thought of as speeding-up pairwise statistical significance estimation using high values of N, where comparable performance is achieved by actually using a much lower number of random shuffles. Experiments indicate that a speed-up of up to 40 as compared to current implementations can be achieved without loss in retrieval accuracy.

Original languageEnglish (US)
Title of host publication2010 ACM International Conference on Bioinformatics and Computational Biology, ACM-BCB 2010
Pages312-321
Number of pages10
DOIs
StatePublished - 2010
Event2010 ACM International Conference on Bioinformatics and Computational Biology, ACM-BCB 2010 - Niagara Falls, NY, United States
Duration: Aug 2 2010Aug 4 2010

Publication series

Name2010 ACM International Conference on Bioinformatics and Computational Biology, ACM-BCB 2010

Other

Other2010 ACM International Conference on Bioinformatics and Computational Biology, ACM-BCB 2010
Country/TerritoryUnited States
CityNiagara Falls, NY
Period8/2/108/4/10

ASJC Scopus subject areas

  • Biomedical Engineering
  • Health Information Management

Fingerprint

Dive into the research topics of 'Derived Distribution Points heuristic for fast pairwise statistical significance estimation'. Together they form a unique fingerprint.

Cite this