Indexing genomic sequences on the IBM Blue Gene

Amol Ghoting*, Konstantin Makarychev

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Scopus citations

Abstract

With advances in sequencing technology and through aggressive sequencing efforts, DNA sequence data sets have been growing at a rapid pace. To gain from these advances, it is important to provide life science researchers with the ability to process and query large sequence data sets. For the past three decades, the suffix tree has served as a fundamental data structure in processing sequential data sets. However, tree construction times on large data sets have been excessive. While parallel suffix tree construction is an obvious solution to reduce execution times, poor locality of reference has limited parallel performance. In this paper, we show that through careful parallel algorithm design, this limitation can be removed, allowing tree construction to scale to massively parallel systems like the IBM Blue Gene. We demonstrate that the entire Human genome can be indexed on 1024 processors in under 15 minutes.

Original languageEnglish (US)
Title of host publicationProceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09
DOIs
StatePublished - 2009
EventConference on High Performance Computing Networking, Storage and Analysis, SC '09 - Portland, OR, United States
Duration: Nov 14 2009Nov 20 2009

Publication series

NameProceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09

Other

OtherConference on High Performance Computing Networking, Storage and Analysis, SC '09
Country/TerritoryUnited States
CityPortland, OR
Period11/14/0911/20/09

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Indexing genomic sequences on the IBM Blue Gene'. Together they form a unique fingerprint.

Cite this