### Abstract

We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g_{1}g _{2} ⋯ g_{m} is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b_{1}b_{2} ⋯ bm of G, every character is probabilistically generated such that the probability for b_{i} ≠ g_{i} is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8 (1 - 1/|∑| ), there exist positive constants c_{0}, ε, and δ_{2} such that if there are at least c_{0} logn input sequences, then in O( n^{2}/h (log n)^{O(1)}) time this algorithm finds the motif with probability at least 3/4 for every G ε ∑^{p} - ψ_{p},h,ε (∑), where n the length of longest sequences, p is the length of the motif, h is a parameter with p ≥ 4h ≥ δ_{2} log n, and εp,h,ε (∑) is a small subset of at most 2-Θ(ε^{2}h) fraction of the sequences in ∑^{p} .

Original language | English (US) |
---|---|

Article number | 26 |

Journal | ACM Transactions on Algorithms |

Volume | 7 |

Issue number | 2 |

DOIs | |

State | Published - Mar 1 2011 |

### Fingerprint

### Keywords

- Complexity
- Motif detection
- Probabilistic analysis
- Probability

### ASJC Scopus subject areas

- Mathematics (miscellaneous)

### Cite this

*ACM Transactions on Algorithms*,

*7*(2), [26]. https://doi.org/10.1145/1921659.1921672