TY - GEN
T1 - Pairwise statistical significance versus database statistical significance for local alignment of protein sequences
AU - Agrawal, Ankit
AU - Brendel, Volker
AU - Huang, Xiaoqiu
PY - 2008/8/27
Y1 - 2008/8/27
N2 - An important aspect of pairwise sequence comparison is assessing the statistical significance of the alignment. Most of the currently popular alignment programs report the statistical significance of an alignment in context of a database search. This database statistical significance is dependent on the database, and hence, the same alignment of a pair of sequences may be assessed different statistical significance values in different databases. In this paper, we explore the use of pairwise statistical significance, which is independent of any database, and can be useful in cases where we only have a pair of sequences and we want to comment on the relatedness of the sequences, independent of any database. We compared different methods and determined that censored maximum likelihood fitting the score distribution right of the peak is the most accurate method for estimating pairwise statistical significance. We evaluated this method in an experiment with a subset of CATH2.3, which had been previoulsy used by other authors as a benchmark data set for protein comparison. Comparison of results with database statistical significance reported by popular programs like SSEARCH and PSI-BLAST indicate that the results of pairwise statistical significance are comparable, indeed sometimes significantly better than those of database statistical significance (with SSEARCH). However, PSI-BLAST performs best, presumably due to its use of query-specific substitution matrices.
AB - An important aspect of pairwise sequence comparison is assessing the statistical significance of the alignment. Most of the currently popular alignment programs report the statistical significance of an alignment in context of a database search. This database statistical significance is dependent on the database, and hence, the same alignment of a pair of sequences may be assessed different statistical significance values in different databases. In this paper, we explore the use of pairwise statistical significance, which is independent of any database, and can be useful in cases where we only have a pair of sequences and we want to comment on the relatedness of the sequences, independent of any database. We compared different methods and determined that censored maximum likelihood fitting the score distribution right of the peak is the most accurate method for estimating pairwise statistical significance. We evaluated this method in an experiment with a subset of CATH2.3, which had been previoulsy used by other authors as a benchmark data set for protein comparison. Comparison of results with database statistical significance reported by popular programs like SSEARCH and PSI-BLAST indicate that the results of pairwise statistical significance are comparable, indeed sometimes significantly better than those of database statistical significance (with SSEARCH). However, PSI-BLAST performs best, presumably due to its use of query-specific substitution matrices.
KW - Database statistical significance
KW - Homologs
KW - Pairwise local alignment
KW - Pairwise statistical significance
UR - http://www.scopus.com/inward/record.url?scp=49949087091&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=49949087091&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-79450-9_6
DO - 10.1007/978-3-540-79450-9_6
M3 - Conference contribution
AN - SCOPUS:49949087091
SN - 3540794492
SN - 9783540794493
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 50
EP - 61
BT - Bioinformatics Research and Applications - Fourth International Symposium, ISBRA 2008, Proceedings
T2 - 4th International Symposium on Bioinformatics Research and Applications, ISBRA 2008
Y2 - 6 May 2008 through 9 May 2008
ER -