TY - GEN
T1 - Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers
AU - Church, Philip C.
AU - Goscinski, Andrzej
AU - Holt, Kathryn
AU - Inouye, Michael
AU - Ghoting, Amol
AU - Makarychev, Konstantin
AU - Reumann, Matthias
PY - 2011
Y1 - 2011
N2 - The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.
AB - The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.
UR - http://www.scopus.com/inward/record.url?scp=84055221912&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84055221912&partnerID=8YFLogxK
U2 - 10.1109/IEMBS.2011.6090208
DO - 10.1109/IEMBS.2011.6090208
M3 - Conference contribution
C2 - 22254462
AN - SCOPUS:84055221912
SN - 9781424441211
T3 - Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS
SP - 924
EP - 927
BT - 33rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS 2011
T2 - 33rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS 2011
Y2 - 30 August 2011 through 3 September 2011
ER -