Maximum likelihood estimation of incomplete genomic spectrum from HTS data

Serghei Mangul*, Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu, Alex Zelikovsky

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations


High-throughput sequencing makes possible to process samples containing multiple genomic sequences and then estimate their frequencies or even assemble them. The maximum likelihood estimation of frequencies of the sequences based on observed reads can be efficiently performed using expectation-maximization (EM) method assuming that we know sequences present in the sample. Frequently, such knowledge is incomplete, e.g., in RNA-seq not all isoforms are known and when sequencing viral quasispecies their sequences are unknown. We propose to enhance EM with a virtual string and incorporate it into frequency estimation tools for RNA-Seq and quasispecies sequencing. Our simulations show that EM enhanced with the virtual string estimates string frequencies more accurately than the original methods and that it can find the reads from missing quasispecies thus enabling their reconstruction.

Original languageEnglish (US)
Title of host publicationAlgorithms in Bioinformatics - 11th International Workshop, WABI 2011, Proceedings
Number of pages12
StatePublished - Sep 26 2011
Event11th Workshop on Algorithms in Bioinformatics, WABI 2011 - Saarbrucken, Germany
Duration: Sep 5 2011Sep 7 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume6833 LNBI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference11th Workshop on Algorithms in Bioinformatics, WABI 2011


  • expectation maximization
  • high-throughput sequencing
  • RNA-Sequencing
  • viral quasispecies

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Fingerprint Dive into the research topics of 'Maximum likelihood estimation of incomplete genomic spectrum from HTS data'. Together they form a unique fingerprint.

Cite this