Serial and parallel methods for I/O efficient suffix tree construction

Amol Ghoting*, Konstantin Makarychev

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Scopus citations

Abstract

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm. To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting - for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation - to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.

Original languageEnglish (US)
Title of host publicationSIGMOD-PODS'09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems
Pages827-840
Number of pages14
DOIs
StatePublished - 2009
EventInternational Conference on Management of Data and 28th Symposium on Principles of Database Systems, SIGMOD-PODS'09 - Providence, RI, United States
Duration: Jun 29 2009Jul 2 2009

Publication series

NameSIGMOD-PODS'09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems

Other

OtherInternational Conference on Management of Data and 28th Symposium on Principles of Database Systems, SIGMOD-PODS'09
Country/TerritoryUnited States
CityProvidence, RI
Period6/29/097/2/09

Keywords

  • Disk-based
  • Externalmemory
  • Genome indexing
  • Parallel
  • Sequence indexing
  • Suffix tree

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Serial and parallel methods for I/O efficient suffix tree construction'. Together they form a unique fingerprint.

Cite this