Classical data mining algorithms require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present and empirically compare two data-reduction algorithms for producing such a sample; these algorithms, called FAST and EA, are tailored to “count” data applications such as association-rule mining. The algorithms are similar in that both attempt to produce a sample whose “distance”—appropriately defined—from the complete database is minimal. They differ greatly, however, in the way they greedily search through the exponential number of possible samples. FAST, recently developed by Chen et al., uses random sampling together with trimming of “outlier” transactions. On the other hand, the EA algorithm, introduced in this chapter, repeatedly and deterministically halves the data to obtain the final sample. Unlike FAST, the EA algorithm provides a guaranteed level of accuracy. Our experiments show that EA is more expensive to run than FAST, but yields more accurate results for a given sample size. Thus, depending on the specific problem under consideration, the user can trade off speed and accuracy by selecting the appropriate method. We conclude by showing how the EA data-reduction approach can potentially be adapted to provide data-reduction schemes for streaming data systems. The proposed schemes favor recent data while still retaining partial information about all of the data seen so far.
|Journal||Data Mining: Next Generation Challenges and Fu Directions|
|State||Published - Oct 2004|