TY - GEN
T1 - Estimating pairwise statistical significance of protein local alignments using a clustering-classification approach based on amino acid composition
AU - Agrawal, Ankit
AU - Ghosh, Arka
AU - Huang, Xiaoqiu
PY - 2008
Y1 - 2008
N2 - A central question in pairwise sequence comparison is assessing the statistical significance of the alignment. The alignment score distribution is known to follow an extreme value distribution with analytically calculable parameters K and λ for ungapped alignments with one substitution matrix. But no statistical theory is currently available for the gapped case and for alignments using multiple scoring matrices, although their score distribution is known to closely follow extreme value distribution and the corresponding parameters can be estimated by simulation. Ideal estimation would require simulation for each sequence pair, which is impractical. In this paper, we present a simple clustering-classification approach based on amino acid composition to estimate K and λ for a given sequence pair and scoring scheme, including using multiple parameter sets. The resulting set of K and λ for different cluster pairs has large variability even for the same scoring scheme, underscoring the heavy dependence of K and λ on the amino acid composition. The proposed approach in this paper is an attempt to separate the influence of amino acid composition in estimation of statistical significance of pairwise protein alignments. Experiments and analysis of other approaches to estimate statistical parameters also indicate that the methods used in this work estimate the statistical significance with good accuracy.
AB - A central question in pairwise sequence comparison is assessing the statistical significance of the alignment. The alignment score distribution is known to follow an extreme value distribution with analytically calculable parameters K and λ for ungapped alignments with one substitution matrix. But no statistical theory is currently available for the gapped case and for alignments using multiple scoring matrices, although their score distribution is known to closely follow extreme value distribution and the corresponding parameters can be estimated by simulation. Ideal estimation would require simulation for each sequence pair, which is impractical. In this paper, we present a simple clustering-classification approach based on amino acid composition to estimate K and λ for a given sequence pair and scoring scheme, including using multiple parameter sets. The resulting set of K and λ for different cluster pairs has large variability even for the same scoring scheme, underscoring the heavy dependence of K and λ on the amino acid composition. The proposed approach in this paper is an attempt to separate the influence of amino acid composition in estimation of statistical significance of pairwise protein alignments. Experiments and analysis of other approaches to estimate statistical parameters also indicate that the methods used in this work estimate the statistical significance with good accuracy.
KW - Classification
KW - Clustering
KW - Pairwise local alignment
KW - Statistical significance
UR - http://www.scopus.com/inward/record.url?scp=49949106287&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=49949106287&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-79450-9_7
DO - 10.1007/978-3-540-79450-9_7
M3 - Conference contribution
AN - SCOPUS:49949106287
SN - 3540794492
SN - 9783540794493
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 62
EP - 73
BT - Bioinformatics Research and Applications - Fourth International Symposium, ISBRA 2008, Proceedings
T2 - 4th International Symposium on Bioinformatics Research and Applications, ISBRA 2008
Y2 - 6 May 2008 through 9 May 2008
ER -