Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability

Bin Fu*, Ming-Yang Kao, Lusheng Wang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

Original languageEnglish (US)
Title of host publicationTheory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings
Pages231-240
Number of pages10
DOIs
StatePublished - Jul 16 2009
Event6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009 - Changsha, China
Duration: May 18 2009May 22 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5532 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009
CountryChina
CityChangsha
Period5/18/095/22/09

Fingerprint

Polynomial time
Polynomials
Motif Discovery
DNA
Probabilistic Model
Efficient Algorithms
Strings
Subset
Character
Background
Statistical Models

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Fu, B., Kao, M-Y., & Wang, L. (2009). Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability. In Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings (pp. 231-240). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5532 LNCS). https://doi.org/10.1007/978-3-642-02017-9_26
Fu, Bin ; Kao, Ming-Yang ; Wang, Lusheng. / Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability. Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings. 2009. pp. 231-240 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{0a6589507619451995dd7b5697b0ac82,
title = "Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability",
abstract = "We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρ-ψ ρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.",
author = "Bin Fu and Ming-Yang Kao and Lusheng Wang",
year = "2009",
month = "7",
day = "16",
doi = "10.1007/978-3-642-02017-9_26",
language = "English (US)",
isbn = "9783642020162",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "231--240",
booktitle = "Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings",

}

Fu, B, Kao, M-Y & Wang, L 2009, Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability. in Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5532 LNCS, pp. 231-240, 6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009, Changsha, China, 5/18/09. https://doi.org/10.1007/978-3-642-02017-9_26

Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability. / Fu, Bin; Kao, Ming-Yang; Wang, Lusheng.

Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings. 2009. p. 231-240 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5532 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability

AU - Fu, Bin

AU - Kao, Ming-Yang

AU - Wang, Lusheng

PY - 2009/7/16

Y1 - 2009/7/16

N2 - We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρ-ψ ρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

AB - We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρ-ψ ρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

UR - http://www.scopus.com/inward/record.url?scp=67650233245&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67650233245&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-02017-9_26

DO - 10.1007/978-3-642-02017-9_26

M3 - Conference contribution

AN - SCOPUS:67650233245

SN - 9783642020162

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 231

EP - 240

BT - Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings

ER -

Fu B, Kao M-Y, Wang L. Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability. In Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings. 2009. p. 231-240. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-02017-9_26