TY - GEN

T1 - Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability

AU - Fu, Bin

AU - Kao, Ming-Yang

AU - Wang, Lusheng

PY - 2009/7/16

Y1 - 2009/7/16

N2 - We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρ-ψ ρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

AB - We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρ-ψ ρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

UR - http://www.scopus.com/inward/record.url?scp=67650233245&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67650233245&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-02017-9_26

DO - 10.1007/978-3-642-02017-9_26

M3 - Conference contribution

AN - SCOPUS:67650233245

SN - 9783642020162

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 231

EP - 240

BT - Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings

T2 - 6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009

Y2 - 18 May 2009 through 22 May 2009

ER -