CS2A: A Compressed Suffix Array-Based Method for Short Read Alignment

Hongwei Huo, Zhigang Sun, Shuangjiang Li, Jeffrey Scott Vitter, Xinkun Wang, Qiang Yu, Jun Huan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

Next generation sequencing technologies generate normous amount of short reads, which poses a significant computational challenge for short read alignment. Furthermore, because of sequence polymorphisms in a population, repetitive sequences, and sequencing errors, there still exist difficulties in correctly aligning all reads. We propose a space-efficient compressed suffix array-based method for short read alignment (CS2A) whose space achieves the high-order empirical entropy of the input string. Unlike BWA that uses two bits to represent a nucleotide, suitable for constant-sized alphabets, our encoding scheme can be applied to the string with any alphabet set. In addition, we present approximate pattern matching on compressed suffix array (CSA) for short read alignment. Our CS2A supports both mismatch and gapped alignments for single-end and paired-end reads mapping, being capable of efficiently aligning short sequencing reads to genome sequences. The experimental results show that CS2A can compete with the popular aligners in memory usage and mapping accuracy. The source code is available online.

Original languageEnglish (US)
Title of host publicationProceedings - DCC 2016
Subtitle of host publication2016 Data Compression Conference
EditorsMichael W. Marcellin, Ali Bilgin, Joan Serra-Sagrista, James A. Storer
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages271-278
Number of pages8
ISBN (Electronic)9781509018536
DOIs
StatePublished - Dec 15 2016
Event2016 Data Compression Conference, DCC 2016 - Snowbird, United States
Duration: Mar 29 2016Apr 1 2016

Publication series

NameData Compression Conference Proceedings
ISSN (Print)1068-0314

Other

Other2016 Data Compression Conference, DCC 2016
CountryUnited States
CitySnowbird
Period3/29/164/1/16

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'CS2A: A Compressed Suffix Array-Based Method for Short Read Alignment'. Together they form a unique fingerprint.

Cite this