Probabilistic analysis of a m otif discovery algorithm for multiple sequences

Bin Fui*, Ming-Yang Kao, Lusheng Wang

*Corresponding author for this work

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet σ. A motif G = g1g2 · · · gm is a string of m characters. Each background sequence is implanted into a probabilistically generated approximate copy of G. For an approximate copy b1b2 · · · bm of G, every character bi is probabilistically generated such that the probability for bi ≠ gi is at most α. In this paper, we give the first analytical proof that multiple background sequences do help with finding subtle and faint motifs. This work is a theoretical approach with a rigorous probabilistic analysis. We develop an algorithm that under the probabilistic model can find the implanted motif with high probability when the number of background sequences is reasonably large. Specifically, we prove that for α < 0.1771 and any constant x ≥ 8, there exist constants t 001 > 0 such that if the length of the motif is at least δ0 logn, the alphabet has at least t0 characters, and there are at least δ1 log n0 input sequences, then in O(n3) time our algorithm finds the motif with probability at least 1-1/2x, where n is the longest length of any input sequence and n0 ≤ n is an upper bound for the length of the motif.

Original languageEnglish (US)
Pages (from-to)1715-1737
Number of pages23
JournalSIAM Journal on Discrete Mathematics
Volume23
Issue number4
DOIs
StatePublished - Dec 1 2009

Fingerprint

Probabilistic Analysis
Motif Discovery
Probabilistic Model
Strings
Character
Background
Upper bound

Keywords

  • Motif
  • Multiple sequences
  • Probabilistic analysis

ASJC Scopus subject areas

  • Mathematics(all)

Cite this

@article{e99a2ae1965245ee85632dee9232fda1,
title = "Probabilistic analysis of a m otif discovery algorithm for multiple sequences",
abstract = "We study a natural probabilistic model for motif discovery that has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet σ. A motif G = g1g2 · · · gm is a string of m characters. Each background sequence is implanted into a probabilistically generated approximate copy of G. For an approximate copy b1b2 · · · bm of G, every character bi is probabilistically generated such that the probability for bi ≠ gi is at most α. In this paper, we give the first analytical proof that multiple background sequences do help with finding subtle and faint motifs. This work is a theoretical approach with a rigorous probabilistic analysis. We develop an algorithm that under the probabilistic model can find the implanted motif with high probability when the number of background sequences is reasonably large. Specifically, we prove that for α < 0.1771 and any constant x ≥ 8, there exist constants t 0,δ0,δ1 > 0 such that if the length of the motif is at least δ0 logn, the alphabet has at least t0 characters, and there are at least δ1 log n0 input sequences, then in O(n3) time our algorithm finds the motif with probability at least 1-1/2x, where n is the longest length of any input sequence and n0 ≤ n is an upper bound for the length of the motif.",
keywords = "Motif, Multiple sequences, Probabilistic analysis",
author = "Bin Fui and Ming-Yang Kao and Lusheng Wang",
year = "2009",
month = "12",
day = "1",
doi = "10.1137/080720401",
language = "English (US)",
volume = "23",
pages = "1715--1737",
journal = "SIAM Journal on Discrete Mathematics",
issn = "0895-4801",
publisher = "Society for Industrial and Applied Mathematics Publications",
number = "4",

}

Probabilistic analysis of a m otif discovery algorithm for multiple sequences. / Fui, Bin; Kao, Ming-Yang; Wang, Lusheng.

In: SIAM Journal on Discrete Mathematics, Vol. 23, No. 4, 01.12.2009, p. 1715-1737.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Probabilistic analysis of a m otif discovery algorithm for multiple sequences

AU - Fui, Bin

AU - Kao, Ming-Yang

AU - Wang, Lusheng

PY - 2009/12/1

Y1 - 2009/12/1

N2 - We study a natural probabilistic model for motif discovery that has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet σ. A motif G = g1g2 · · · gm is a string of m characters. Each background sequence is implanted into a probabilistically generated approximate copy of G. For an approximate copy b1b2 · · · bm of G, every character bi is probabilistically generated such that the probability for bi ≠ gi is at most α. In this paper, we give the first analytical proof that multiple background sequences do help with finding subtle and faint motifs. This work is a theoretical approach with a rigorous probabilistic analysis. We develop an algorithm that under the probabilistic model can find the implanted motif with high probability when the number of background sequences is reasonably large. Specifically, we prove that for α < 0.1771 and any constant x ≥ 8, there exist constants t 0,δ0,δ1 > 0 such that if the length of the motif is at least δ0 logn, the alphabet has at least t0 characters, and there are at least δ1 log n0 input sequences, then in O(n3) time our algorithm finds the motif with probability at least 1-1/2x, where n is the longest length of any input sequence and n0 ≤ n is an upper bound for the length of the motif.

AB - We study a natural probabilistic model for motif discovery that has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet σ. A motif G = g1g2 · · · gm is a string of m characters. Each background sequence is implanted into a probabilistically generated approximate copy of G. For an approximate copy b1b2 · · · bm of G, every character bi is probabilistically generated such that the probability for bi ≠ gi is at most α. In this paper, we give the first analytical proof that multiple background sequences do help with finding subtle and faint motifs. This work is a theoretical approach with a rigorous probabilistic analysis. We develop an algorithm that under the probabilistic model can find the implanted motif with high probability when the number of background sequences is reasonably large. Specifically, we prove that for α < 0.1771 and any constant x ≥ 8, there exist constants t 0,δ0,δ1 > 0 such that if the length of the motif is at least δ0 logn, the alphabet has at least t0 characters, and there are at least δ1 log n0 input sequences, then in O(n3) time our algorithm finds the motif with probability at least 1-1/2x, where n is the longest length of any input sequence and n0 ≤ n is an upper bound for the length of the motif.

KW - Motif

KW - Multiple sequences

KW - Probabilistic analysis

UR - http://www.scopus.com/inward/record.url?scp=77950845583&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77950845583&partnerID=8YFLogxK

U2 - 10.1137/080720401

DO - 10.1137/080720401

M3 - Article

AN - SCOPUS:77950845583

VL - 23

SP - 1715

EP - 1737

JO - SIAM Journal on Discrete Mathematics

JF - SIAM Journal on Discrete Mathematics

SN - 0895-4801

IS - 4

ER -