### Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g _{1}g_{2} ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b_{1}b _{2} ⋯ b_{m} of G, every character is probabilistically generated such that the probability for b_{i} ≠ g_{i} is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c_{0}, ε, δ_{1} and δ_{2} such that if the length ρ of motif G is at least δ_{1} log n, and there are k ≥ c_{0} log n input sequences, then in O(n ^{2}+kn) time this algorithm finds the motif with probability at least 1- 1/2_{x} for every G ε ∑^{ρ}-ψ _{ρ,h,ε}(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ_{2} log n, and ψ_{ρ,h,ε}(∑) is a small subset of at most 2- ^{θ(ε2h}) fraction of the sequences in ∑^{ρ}. The constants c_{0}, ε, δ_{1} and d_{2} do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

Original language | English (US) |
---|---|

Title of host publication | Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings |

Pages | 231-240 |

Number of pages | 10 |

DOIs | |

State | Published - Jul 16 2009 |

Event | 6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009 - Changsha, China Duration: May 18 2009 → May 22 2009 |

### Publication series

Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|

Volume | 5532 LNCS |

ISSN (Print) | 0302-9743 |

ISSN (Electronic) | 1611-3349 |

### Other

Other | 6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009 |
---|---|

Country | China |

City | Changsha |

Period | 5/18/09 → 5/22/09 |

### Fingerprint

### ASJC Scopus subject areas

- Theoretical Computer Science
- Computer Science(all)

### Cite this

*Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings*(pp. 231-240). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5532 LNCS). https://doi.org/10.1007/978-3-642-02017-9_26

}

*Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings.*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5532 LNCS, pp. 231-240, 6th Annual Conference on Theory and Applications of Models of Computation, TAMC 2009, Changsha, China, 5/18/09. https://doi.org/10.1007/978-3-642-02017-9_26

**Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability.** / Fu, Bin; Kao, Ming-Yang; Wang, Lusheng.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Discovering almost any hidden motif from multiple sequences in polynomial time with low sample complexity and high success probability

AU - Fu, Bin

AU - Kao, Ming-Yang

AU - Wang, Lusheng

PY - 2009/7/16

Y1 - 2009/7/16

N2 - We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρ-ψ ρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

AB - We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑ A motif G = g 1g2 ⋯ gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b 2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most a. It has been conjectured that multiple background sequences can help with finding faint motifs G. In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for a ≤ 1/4 (1 - 1|∑|) and any constant x ≥ 8, there exist positive constants c0, ε, δ1 and δ2 such that if the length ρ of motif G is at least δ1 log n, and there are k ≥ c0 log n input sequences, then in O(n 2+kn) time this algorithm finds the motif with probability at least 1- 1/2x for every G ε ∑ρ-ψ ρ,h,ε(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ2 log n, and ψρ,h,ε(∑) is a small subset of at most 2- θ(ε2h) fraction of the sequences in ∑ρ. The constants c0, ε, δ1 and d2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

UR - http://www.scopus.com/inward/record.url?scp=67650233245&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67650233245&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-02017-9_26

DO - 10.1007/978-3-642-02017-9_26

M3 - Conference contribution

AN - SCOPUS:67650233245

SN - 9783642020162

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 231

EP - 240

BT - Theory and Applications of Models of Computation - 6th Annual Conference, TAMC 2009, Proceedings

ER -