TY - JOUR
T1 - Estimation of alternative splicing isoform frequencies from RNA-Seq data
AU - Nicolae, Marius
AU - Mangul, Serghei
AU - Mǎndoiu, Ion I.
AU - Zelikovsky, Alex
N1 - Funding Information:
MN and IIM were supported in part by NSF awards IIS-0546457, IIS-0916948, and DBI-0543365 and NIFA award 2011-67016-30331. SM and AZ were supported in part by NSF award IIS-0916401 and NIFA award 2011-67016-30331. All authors would like to thank the anonymous referees for many constructive comments that helped improving the presentation.
PY - 2011/4/19
Y1 - 2011/4/19
N2 - Background: Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging.Results: In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/.Conclusions: Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.
AB - Background: Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging.Results: In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/.Conclusions: Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.
UR - http://www.scopus.com/inward/record.url?scp=79955082292&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79955082292&partnerID=8YFLogxK
U2 - 10.1186/1748-7188-6-9
DO - 10.1186/1748-7188-6-9
M3 - Article
C2 - 21504602
AN - SCOPUS:79955082292
VL - 6
JO - Algorithms for Molecular Biology
JF - Algorithms for Molecular Biology
SN - 1748-7188
IS - 1
M1 - 9
ER -