A new two-phase sampling based algorithm for discovering association rules

Bin Chen*, Peter Haas, Peter I Scheuermann

*Corresponding author for this work

Research output: Contribution to conferencePaper

85 Citations (Scopus)

Abstract

This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90-95% accuracy using a final sample having a size of only 15-33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database - even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data.

Original languageEnglish (US)
Pages462-468
Number of pages7
StatePublished - Dec 1 2002
EventKDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Edmonton, Alta, Canada
Duration: Jul 23 2002Jul 26 2002

Other

OtherKDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
CountryCanada
CityEdmonton, Alta
Period7/23/027/26/02

Fingerprint

Association rules
Sampling
Processing

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Chen, B., Haas, P., & Scheuermann, P. I. (2002). A new two-phase sampling based algorithm for discovering association rules. 462-468. Paper presented at KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alta, Canada.
Chen, Bin ; Haas, Peter ; Scheuermann, Peter I. / A new two-phase sampling based algorithm for discovering association rules. Paper presented at KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alta, Canada.7 p.
@conference{8369f7d474c94fb4b47d01c29c9c64a8,
title = "A new two-phase sampling based algorithm for discovering association rules",
abstract = "This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim {"}outlier{"} transactions or select {"}representative{"} transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90-95{\%} accuracy using a final sample having a size of only 15-33{\%} of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database - even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine {"}count{"} data.",
author = "Bin Chen and Peter Haas and Scheuermann, {Peter I}",
year = "2002",
month = "12",
day = "1",
language = "English (US)",
pages = "462--468",
note = "KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ; Conference date: 23-07-2002 Through 26-07-2002",

}

Chen, B, Haas, P & Scheuermann, PI 2002, 'A new two-phase sampling based algorithm for discovering association rules' Paper presented at KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alta, Canada, 7/23/02 - 7/26/02, pp. 462-468.

A new two-phase sampling based algorithm for discovering association rules. / Chen, Bin; Haas, Peter; Scheuermann, Peter I.

2002. 462-468 Paper presented at KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alta, Canada.

Research output: Contribution to conferencePaper

TY - CONF

T1 - A new two-phase sampling based algorithm for discovering association rules

AU - Chen, Bin

AU - Haas, Peter

AU - Scheuermann, Peter I

PY - 2002/12/1

Y1 - 2002/12/1

N2 - This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90-95% accuracy using a final sample having a size of only 15-33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database - even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data.

AB - This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90-95% accuracy using a final sample having a size of only 15-33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database - even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data.

UR - http://www.scopus.com/inward/record.url?scp=0242625264&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0242625264&partnerID=8YFLogxK

M3 - Paper

SP - 462

EP - 468

ER -

Chen B, Haas P, Scheuermann PI. A new two-phase sampling based algorithm for discovering association rules. 2002. Paper presented at KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alta, Canada.