Abstract
Next generation sequencing technologies generate normous amount of short reads, which poses a significant computational challenge for short read alignment. Furthermore, because of sequence polymorphisms in a population, repetitive sequences, and sequencing errors, there still exist difficulties in correctly aligning all reads. We propose a space-efficient compressed suffix array-based method for short read alignment (CS2A) whose space achieves the high-order empirical entropy of the input string. Unlike BWA that uses two bits to represent a nucleotide, suitable for constant-sized alphabets, our encoding scheme can be applied to the string with any alphabet set. In addition, we present approximate pattern matching on compressed suffix array (CSA) for short read alignment. Our CS2A supports both mismatch and gapped alignments for single-end and paired-end reads mapping, being capable of efficiently aligning short sequencing reads to genome sequences. The experimental results show that CS2A can compete with the popular aligners in memory usage and mapping accuracy. The source code is available online.
Original language | English (US) |
---|---|
Title of host publication | Proceedings - DCC 2016 |
Subtitle of host publication | 2016 Data Compression Conference |
Editors | Michael W. Marcellin, Ali Bilgin, Joan Serra-Sagrista, James A. Storer |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 271-278 |
Number of pages | 8 |
ISBN (Electronic) | 9781509018536 |
DOIs | |
State | Published - Dec 15 2016 |
Event | 2016 Data Compression Conference, DCC 2016 - Snowbird, United States Duration: Mar 29 2016 → Apr 1 2016 |
Publication series
Name | Data Compression Conference Proceedings |
---|---|
ISSN (Print) | 1068-0314 |
Other
Other | 2016 Data Compression Conference, DCC 2016 |
---|---|
Country/Territory | United States |
City | Snowbird |
Period | 3/29/16 → 4/1/16 |
Funding
ACKNOWLEDGMENT The authors would like to thank Heng Li for clarifying some data structures related to BWA [11], and also wish to thank the anonymous reviewers for their careful reading and their constructive comments, which have considerably improved the quality of the paper. This work is supported in part by China NSF grants 61173025 and 61373044 (H. Huo), and China NSF grant 61502366 (Q. Yu), by US NSF grant CCF-1017623 (J.S. Vitter), and US NIH grants P20 GM103638, P30 AG035982, and P30 HD002528 (X. Wang).
ASJC Scopus subject areas
- Computer Networks and Communications