Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability

Bin Fu*, Ming-Yang Kao, Lusheng Wang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

Original languageEnglish (US)
Title of host publicationTheory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings
Pages231-240
Number of pages10
DOIs
StatePublished - Jul 16 2009
Event6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009 - Changsha, China
Duration: May 18 2009May 22 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5532 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009
CountryChina
CityChangsha
Period5/18/095/22/09

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability'. Together they form a unique fingerprint.

Cite this