Efficient data reduction with EASE

Hervé Brönnimann*, Bin Chen, Manoranjan Dash, Peter Haas, Peter Scheuermann

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

44 Scopus citations

Abstract

A variety of mining and analysis problems - ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes - involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" - appropriately defined - from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical X2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

Original languageEnglish (US)
Pages59-68
Number of pages10
DOIs
StatePublished - Dec 1 2003
Event9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 - Washington, DC, United States
Duration: Aug 24 2003Aug 27 2003

Other

Other9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03
CountryUnited States
CityWashington, DC
Period8/24/038/27/03

Keywords

  • Association rules
  • Count dataset
  • Data streams
  • Frequency estimation
  • OLAP
  • Sampling

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint Dive into the research topics of 'Efficient data reduction with EASE'. Together they form a unique fingerprint.

Cite this