### Abstract

A variety of mining and analysis problems - ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes - involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" - appropriately defined - from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical X^{2} contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

Original language | English (US) |
---|---|

Pages | 59-68 |

Number of pages | 10 |

DOIs | |

State | Published - Dec 1 2003 |

Event | 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 - Washington, DC, United States Duration: Aug 24 2003 → Aug 27 2003 |

### Other

Other | 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 |
---|---|

Country | United States |

City | Washington, DC |

Period | 8/24/03 → 8/27/03 |

### Fingerprint

### Keywords

- Association rules
- Count dataset
- Data streams
- Frequency estimation
- OLAP
- Sampling

### ASJC Scopus subject areas

- Software
- Information Systems

### Cite this

*Efficient data reduction with EASE*. 59-68. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States. https://doi.org/10.1145/956750.956761

}

**Efficient data reduction with EASE.** / Brönnimann, Hervé; Chen, Bin; Dash, Manoranjan; Haas, Peter; Scheuermann, Peter I.

Research output: Contribution to conference › Paper

TY - CONF

T1 - Efficient data reduction with EASE

AU - Brönnimann, Hervé

AU - Chen, Bin

AU - Dash, Manoranjan

AU - Haas, Peter

AU - Scheuermann, Peter I

PY - 2003/12/1

Y1 - 2003/12/1

N2 - A variety of mining and analysis problems - ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes - involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" - appropriately defined - from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical X2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

AB - A variety of mining and analysis problems - ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes - involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" - appropriately defined - from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical X2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

KW - Association rules

KW - Count dataset

KW - Data streams

KW - Frequency estimation

KW - OLAP

KW - Sampling

UR - http://www.scopus.com/inward/record.url?scp=26944447201&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=26944447201&partnerID=8YFLogxK

U2 - 10.1145/956750.956761

DO - 10.1145/956750.956761

M3 - Paper

SP - 59

EP - 68

ER -