Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability

Bin Fu*, Ming-Yang Kao, Lusheng Wang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

Original languageEnglish (US)
Title of host publicationTheory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings
Pages231-240
Number of pages10
DOIs
StatePublished - Jul 16 2009
Event6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009 - Changsha, China
Duration: May 18 2009May 22 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5532 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009
CountryChina
CityChangsha
Period5/18/095/22/09

    Fingerprint

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Fu, B., Kao, M-Y., & Wang, L. (2009). Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability. In Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings (pp. 231-240). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5532 LNCS). https://doi.org/10.1007/978-3-642-02017-9_26