Discovering almost any hidden motif from multiple sequences

Bin Fu*, Ming-Yang Kao, Lusheng Wang

*Corresponding author for this work

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g1g 2 ⋯ gm is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8 (1 - 1/|∑| ), there exist positive constants c0, ε, and δ2 such that if there are at least c0 logn input sequences, then in O( n2/h (log n)O(1)) time this algorithm finds the motif with probability at least 3/4 for every G ε ∑p - ψp,h,ε (∑), where n the length of longest sequences, p is the length of the motif, h is a parameter with p ≥ 4h ≥ δ2 log n, and εp,h,ε (∑) is a small subset of at most 2-Θ(ε2h) fraction of the sequences in ∑p .

Original languageEnglish (US)
Article number26
JournalACM Transactions on Algorithms
Volume7
Issue number2
DOIs
StatePublished - Mar 1 2011

Fingerprint

Motif Discovery
Probabilistic Model
Efficient Algorithms
Strings
Subset
Character
Background
Model

Keywords

  • Complexity
  • Motif detection
  • Probabilistic analysis
  • Probability

ASJC Scopus subject areas

  • Mathematics (miscellaneous)

Cite this

@article{8b1811c87ecd4439ac4ce974fccdce07,
title = "Discovering almost any hidden motif from multiple sequences",
abstract = "We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g1g 2 ⋯ gm is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8 (1 - 1/|∑| ), there exist positive constants c0, ε, and δ2 such that if there are at least c0 logn input sequences, then in O( n2/h (log n)O(1)) time this algorithm finds the motif with probability at least 3/4 for every G ε ∑p - ψp,h,ε (∑), where n the length of longest sequences, p is the length of the motif, h is a parameter with p ≥ 4h ≥ δ2 log n, and εp,h,ε (∑) is a small subset of at most 2-Θ(ε2h) fraction of the sequences in ∑p .",
keywords = "Complexity, Motif detection, Probabilistic analysis, Probability",
author = "Bin Fu and Ming-Yang Kao and Lusheng Wang",
year = "2011",
month = "3",
day = "1",
doi = "10.1145/1921659.1921672",
language = "English (US)",
volume = "7",
journal = "ACM Transactions on Algorithms",
issn = "1549-6325",
publisher = "Association for Computing Machinery (ACM)",
number = "2",

}

Discovering almost any hidden motif from multiple sequences. / Fu, Bin; Kao, Ming-Yang; Wang, Lusheng.

In: ACM Transactions on Algorithms, Vol. 7, No. 2, 26, 01.03.2011.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Discovering almost any hidden motif from multiple sequences

AU - Fu, Bin

AU - Kao, Ming-Yang

AU - Wang, Lusheng

PY - 2011/3/1

Y1 - 2011/3/1

N2 - We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g1g 2 ⋯ gm is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8 (1 - 1/|∑| ), there exist positive constants c0, ε, and δ2 such that if there are at least c0 logn input sequences, then in O( n2/h (log n)O(1)) time this algorithm finds the motif with probability at least 3/4 for every G ε ∑p - ψp,h,ε (∑), where n the length of longest sequences, p is the length of the motif, h is a parameter with p ≥ 4h ≥ δ2 log n, and εp,h,ε (∑) is a small subset of at most 2-Θ(ε2h) fraction of the sequences in ∑p .

AB - We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g1g 2 ⋯ gm is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b2 ⋯ bm of G, every character is probabilistically generated such that the probability for bi ≠ gi is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8 (1 - 1/|∑| ), there exist positive constants c0, ε, and δ2 such that if there are at least c0 logn input sequences, then in O( n2/h (log n)O(1)) time this algorithm finds the motif with probability at least 3/4 for every G ε ∑p - ψp,h,ε (∑), where n the length of longest sequences, p is the length of the motif, h is a parameter with p ≥ 4h ≥ δ2 log n, and εp,h,ε (∑) is a small subset of at most 2-Θ(ε2h) fraction of the sequences in ∑p .

KW - Complexity

KW - Motif detection

KW - Probabilistic analysis

KW - Probability

UR - http://www.scopus.com/inward/record.url?scp=79953237098&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79953237098&partnerID=8YFLogxK

U2 - 10.1145/1921659.1921672

DO - 10.1145/1921659.1921672

M3 - Article

AN - SCOPUS:79953237098

VL - 7

JO - ACM Transactions on Algorithms

JF - ACM Transactions on Algorithms

SN - 1549-6325

IS - 2

M1 - 26

ER -