Efficient parallel classification using dimensional aggregates

Sanjay Goil, Alok Choudhary

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Multidimensional aggregates are frequently computed to improve query performance in Online Analytical Processing applications. We present a new method for decision tree based classification trees using the aggregates computed in the multidimensional data model. The structure imposed on data in a explicit multidimensional storage mechanism leads to efficient dimensional operations. Decision tree based classification algorithms perform computations to find the best split point at each node of the tree. Efficient computation of the split in the decision tree can be done by using the one-dimensional aggregates if the cell values are the class-id values, and counts are maintained for each class. This is used repeatedly at the nodes of the decision tree to calculate splits and manage data. Previous parallel approaches for decision-tree based classification use sorted attribute lists and hash tables to compute the split point and split the data appropriately. The amount of data communicated is proportional to the product of number of records in the training set, and the number of dimensions, at each level of the tree, in the worst case. Parallel formulation of our approach uses data communication proportional to the product of the sum of cardinality of all dimensions and the number of non-classified nodes at each level of the tree. Communication volume is greatly reduced in our approach and is done in one phase of communication at each level of the tree, by coalescing messages. Preliminary results from our experiments on a coarse-grained, distributed memory parallel machine (IBM-SP2) show good performance.

Original languageEnglish (US)
Title of host publicationLarge-Scale Parallel Data Mining
EditorsChing-Tien Ho, Mohammed J. Zaki
PublisherSpringer Verlag
Pages197-210
Number of pages14
ISBN (Print)3540671943, 9783540671947
StatePublished - 2002
Externally publishedYes
Event5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999 - San Diego, United States
Duration: Aug 15 1999Aug 15 1999

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1759
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999
Country/TerritoryUnited States
CitySan Diego
Period8/15/998/15/99

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Efficient parallel classification using dimensional aggregates'. Together they form a unique fingerprint.

Cite this