TY - JOUR
T1 - Discovering almost any hidden motif from multiple sequences
AU - Fu, Bin
AU - Kao, Ming-Yang
AU - Wang, Lusheng
PY - 2011/3/1
Y1 - 2011/3/1
N2 - We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g1g 2 ⋯ gm is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8 (1 - 1/|∑| ), there exist positive constants c0, ε, and δ2 such that if there are at least c0 logn input sequences, then in O( n2/h (log n)O(1)) time this algorithm finds the motif with probability at least 3/4 for every G ε ∑p - ψp,h,ε (∑), where n the length of longest sequences, p is the length of the motif, h is a parameter with p ≥ 4h ≥ δ2 log n, and εp,h,ε (∑) is a small subset of at most 2-Θ(ε2h) fraction of the sequences in ∑p .
AB - We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g1g 2 ⋯ gm is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8 (1 - 1/|∑| ), there exist positive constants c0, ε, and δ2 such that if there are at least c0 logn input sequences, then in O( n2/h (log n)O(1)) time this algorithm finds the motif with probability at least 3/4 for every G ε ∑p - ψp,h,ε (∑), where n the length of longest sequences, p is the length of the motif, h is a parameter with p ≥ 4h ≥ δ2 log n, and εp,h,ε (∑) is a small subset of at most 2-Θ(ε2h) fraction of the sequences in ∑p .
KW - Complexity
KW - Motif detection
KW - Probabilistic analysis
KW - Probability
UR - http://www.scopus.com/inward/record.url?scp=79953237098&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79953237098&partnerID=8YFLogxK
U2 - 10.1145/1921659.1921672
DO - 10.1145/1921659.1921672
M3 - Article
AN - SCOPUS:79953237098
SN - 1549-6325
VL - 7
JO - ACM Transactions on Algorithms
JF - ACM Transactions on Algorithms
IS - 2
M1 - 26
ER -