A new two-phase sampling based algorithm for discovering association rules

Bin Chen*, Peter Haas, Peter I Scheuermann

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

98 Scopus citations

Abstract

This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90-95% accuracy using a final sample having a size of only 15-33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database - even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data.

Original languageEnglish (US)
Pages462-468
Number of pages7
StatePublished - Dec 1 2002
EventKDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Edmonton, Alta, Canada
Duration: Jul 23 2002Jul 26 2002

Other

OtherKDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Country/TerritoryCanada
CityEdmonton, Alta
Period7/23/027/26/02

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'A new two-phase sampling based algorithm for discovering association rules'. Together they form a unique fingerprint.

Cite this