TY - GEN
T1 - Derived Distribution Points heuristic for fast pairwise statistical significance estimation
AU - Agrawal, Ankit
AU - Choudhary, Alok
AU - Huang, Xiaoqiu
PY - 2010
Y1 - 2010
N2 - Estimation of statistical significance of a pairwise sequence alignment is crucial in homology detection. A recent development in the field is the use of pairwise statistical significance as an alternative to database statistical significance. Although pairwise statistical significance has been shown to be potentially better than database statistical significance in terms of homology detection retrieval accuracy, currently it is much time consuming since it involves generating an empirical score distribution by aligning one sequence of the sequence-pair with N random shuffles of the other sequence. A high value of N produces (statistically and potentially biologically) accurate estimates, but also consumes more time. A low value of N leads to inaccurate fitting of the score distribution, and hence poor estimates of statistical significance. In this paper, we propose a simple heuristic, called the Derived Distribution Points (DDP) heuristic, which is designed taking into account the features of the pairwise statistical significance estimation procedure, and has shown to significantly improve the quality of pairwise statistical significance estimates (evaluated in terms of retrieval accuracy) even when using low values of N. Alternatively, it can be thought of as speeding-up pairwise statistical significance estimation using high values of N, where comparable performance is achieved by actually using a much lower number of random shuffles. Experiments indicate that a speed-up of up to 40 as compared to current implementations can be achieved without loss in retrieval accuracy.
AB - Estimation of statistical significance of a pairwise sequence alignment is crucial in homology detection. A recent development in the field is the use of pairwise statistical significance as an alternative to database statistical significance. Although pairwise statistical significance has been shown to be potentially better than database statistical significance in terms of homology detection retrieval accuracy, currently it is much time consuming since it involves generating an empirical score distribution by aligning one sequence of the sequence-pair with N random shuffles of the other sequence. A high value of N produces (statistically and potentially biologically) accurate estimates, but also consumes more time. A low value of N leads to inaccurate fitting of the score distribution, and hence poor estimates of statistical significance. In this paper, we propose a simple heuristic, called the Derived Distribution Points (DDP) heuristic, which is designed taking into account the features of the pairwise statistical significance estimation procedure, and has shown to significantly improve the quality of pairwise statistical significance estimates (evaluated in terms of retrieval accuracy) even when using low values of N. Alternatively, it can be thought of as speeding-up pairwise statistical significance estimation using high values of N, where comparable performance is achieved by actually using a much lower number of random shuffles. Experiments indicate that a speed-up of up to 40 as compared to current implementations can be achieved without loss in retrieval accuracy.
UR - http://www.scopus.com/inward/record.url?scp=77958023962&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77958023962&partnerID=8YFLogxK
U2 - 10.1145/1854776.1854819
DO - 10.1145/1854776.1854819
M3 - Conference contribution
AN - SCOPUS:77958023962
SN - 9781450304382
T3 - 2010 ACM International Conference on Bioinformatics and Computational Biology, ACM-BCB 2010
SP - 312
EP - 321
BT - 2010 ACM International Conference on Bioinformatics and Computational Biology, ACM-BCB 2010
T2 - 2010 ACM International Conference on Bioinformatics and Computational Biology, ACM-BCB 2010
Y2 - 2 August 2010 through 4 August 2010
ER -