Efficient data reduction with EASE

Hervé Brönnimann*, Bin Chen, Manoranjan Dash, Peter Haas, Peter I Scheuermann

*Corresponding author for this work

Research output: Contribution to conferencePaper

42 Citations (Scopus)

Abstract

A variety of mining and analysis problems - ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes - involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" - appropriately defined - from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical X2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

Original languageEnglish (US)
Pages59-68
Number of pages10
DOIs
StatePublished - Dec 1 2003
Event9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 - Washington, DC, United States
Duration: Aug 24 2003Aug 27 2003

Other

Other9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03
CountryUnited States
CityWashington, DC
Period8/24/038/27/03

Fingerprint

Data reduction
Association rules
Scalability
Sampling
Experiments

Keywords

  • Association rules
  • Count dataset
  • Data streams
  • Frequency estimation
  • OLAP
  • Sampling

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Brönnimann, H., Chen, B., Dash, M., Haas, P., & Scheuermann, P. I. (2003). Efficient data reduction with EASE. 59-68. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States. https://doi.org/10.1145/956750.956761
Brönnimann, Hervé ; Chen, Bin ; Dash, Manoranjan ; Haas, Peter ; Scheuermann, Peter I. / Efficient data reduction with EASE. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States.10 p.
@conference{50141f42d01c464c98432915bde1cdea,
title = "Efficient data reduction with EASE",
abstract = "A variety of mining and analysis problems - ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes - involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of {"}transactions,{"} where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose {"}distance{"} - appropriately defined - from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical X2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.",
keywords = "Association rules, Count dataset, Data streams, Frequency estimation, OLAP, Sampling",
author = "Herv{\'e} Br{\"o}nnimann and Bin Chen and Manoranjan Dash and Peter Haas and Scheuermann, {Peter I}",
year = "2003",
month = "12",
day = "1",
doi = "10.1145/956750.956761",
language = "English (US)",
pages = "59--68",
note = "9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 ; Conference date: 24-08-2003 Through 27-08-2003",

}

Brönnimann, H, Chen, B, Dash, M, Haas, P & Scheuermann, PI 2003, 'Efficient data reduction with EASE' Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States, 8/24/03 - 8/27/03, pp. 59-68. https://doi.org/10.1145/956750.956761

Efficient data reduction with EASE. / Brönnimann, Hervé; Chen, Bin; Dash, Manoranjan; Haas, Peter; Scheuermann, Peter I.

2003. 59-68 Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Efficient data reduction with EASE

AU - Brönnimann, Hervé

AU - Chen, Bin

AU - Dash, Manoranjan

AU - Haas, Peter

AU - Scheuermann, Peter I

PY - 2003/12/1

Y1 - 2003/12/1

N2 - A variety of mining and analysis problems - ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes - involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" - appropriately defined - from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical X2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

AB - A variety of mining and analysis problems - ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes - involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" - appropriately defined - from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical X2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

KW - Association rules

KW - Count dataset

KW - Data streams

KW - Frequency estimation

KW - OLAP

KW - Sampling

UR - http://www.scopus.com/inward/record.url?scp=26944447201&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=26944447201&partnerID=8YFLogxK

U2 - 10.1145/956750.956761

DO - 10.1145/956750.956761

M3 - Paper

SP - 59

EP - 68

ER -

Brönnimann H, Chen B, Dash M, Haas P, Scheuermann PI. Efficient data reduction with EASE. 2003. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States. https://doi.org/10.1145/956750.956761