AGORAS: A fast algorithm for estimating medoids in large datasets

Esteban M. Rangel, William Hendrix, Ankit Agrawal, Wei-Keng Liao, Alok Nidhi Choudhary

Research output: Contribution to journalConference article

4 Citations (Scopus)

Abstract

The k-medoids methods for modeling clustered data have many desirable properties such as robustness to noise and the ability to use non-numerical values, however, they are typically not applied to large datasets due to their associated computational complexity. In this paper, we present AGORAS, a novel heuristic algorithm for the k-medoids problem where the algorithmic complexity is driven by, k, the number of clusters, rather than, n, the number of data points. Our algorithm attempts to isolate a sample from each individual cluster within a sequence of uniformly drawn samples taken from the complete data. As a result, computing the k-medoids solution using our method only involves solving k trivial sub-problems of centrality. This allows our algorithm to run in comparable time for arbitrarily large datasets with same underlying density distribution. We evaluate AGORAS experimentally against PAM and CLARANS - two of the best-known existing algorithms for the k-medoids problem - across a variety of published and synthetic datasets. We find that AGORAS outperforms PAM by up to four orders of magnitude for data sets with less than 10,000 points, and it outperforms CLARANS by two orders of magnitude on a dataset of just 64,000 points. Moreover, we find in some cases that AGORAS also outperforms in terms of cluster quality.

Original languageEnglish (US)
Pages (from-to)1159-1169
Number of pages11
JournalProcedia Computer Science
Volume80
DOIs
StatePublished - Jan 1 2016
EventInternational Conference on Computational Science, ICCS 2016 - San Diego, United States
Duration: Jun 6 2016Jun 8 2016

Fingerprint

Pulse amplitude modulation
Heuristic algorithms
Data structures
Computational complexity

Keywords

  • Cluster analysis
  • K-medoids
  • Partitional clustering

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

@article{f83edb7eca5c47ddbd66d90aed35dd4e,
title = "AGORAS: A fast algorithm for estimating medoids in large datasets",
abstract = "The k-medoids methods for modeling clustered data have many desirable properties such as robustness to noise and the ability to use non-numerical values, however, they are typically not applied to large datasets due to their associated computational complexity. In this paper, we present AGORAS, a novel heuristic algorithm for the k-medoids problem where the algorithmic complexity is driven by, k, the number of clusters, rather than, n, the number of data points. Our algorithm attempts to isolate a sample from each individual cluster within a sequence of uniformly drawn samples taken from the complete data. As a result, computing the k-medoids solution using our method only involves solving k trivial sub-problems of centrality. This allows our algorithm to run in comparable time for arbitrarily large datasets with same underlying density distribution. We evaluate AGORAS experimentally against PAM and CLARANS - two of the best-known existing algorithms for the k-medoids problem - across a variety of published and synthetic datasets. We find that AGORAS outperforms PAM by up to four orders of magnitude for data sets with less than 10,000 points, and it outperforms CLARANS by two orders of magnitude on a dataset of just 64,000 points. Moreover, we find in some cases that AGORAS also outperforms in terms of cluster quality.",
keywords = "Cluster analysis, K-medoids, Partitional clustering",
author = "Rangel, {Esteban M.} and William Hendrix and Ankit Agrawal and Wei-Keng Liao and Choudhary, {Alok Nidhi}",
year = "2016",
month = "1",
day = "1",
doi = "10.1016/j.procs.2016.05.446",
language = "English (US)",
volume = "80",
pages = "1159--1169",
journal = "Procedia Computer Science",
issn = "1877-0509",
publisher = "Elsevier BV",

}

AGORAS : A fast algorithm for estimating medoids in large datasets. / Rangel, Esteban M.; Hendrix, William; Agrawal, Ankit; Liao, Wei-Keng; Choudhary, Alok Nidhi.

In: Procedia Computer Science, Vol. 80, 01.01.2016, p. 1159-1169.

Research output: Contribution to journalConference article

TY - JOUR

T1 - AGORAS

T2 - A fast algorithm for estimating medoids in large datasets

AU - Rangel, Esteban M.

AU - Hendrix, William

AU - Agrawal, Ankit

AU - Liao, Wei-Keng

AU - Choudhary, Alok Nidhi

PY - 2016/1/1

Y1 - 2016/1/1

N2 - The k-medoids methods for modeling clustered data have many desirable properties such as robustness to noise and the ability to use non-numerical values, however, they are typically not applied to large datasets due to their associated computational complexity. In this paper, we present AGORAS, a novel heuristic algorithm for the k-medoids problem where the algorithmic complexity is driven by, k, the number of clusters, rather than, n, the number of data points. Our algorithm attempts to isolate a sample from each individual cluster within a sequence of uniformly drawn samples taken from the complete data. As a result, computing the k-medoids solution using our method only involves solving k trivial sub-problems of centrality. This allows our algorithm to run in comparable time for arbitrarily large datasets with same underlying density distribution. We evaluate AGORAS experimentally against PAM and CLARANS - two of the best-known existing algorithms for the k-medoids problem - across a variety of published and synthetic datasets. We find that AGORAS outperforms PAM by up to four orders of magnitude for data sets with less than 10,000 points, and it outperforms CLARANS by two orders of magnitude on a dataset of just 64,000 points. Moreover, we find in some cases that AGORAS also outperforms in terms of cluster quality.

AB - The k-medoids methods for modeling clustered data have many desirable properties such as robustness to noise and the ability to use non-numerical values, however, they are typically not applied to large datasets due to their associated computational complexity. In this paper, we present AGORAS, a novel heuristic algorithm for the k-medoids problem where the algorithmic complexity is driven by, k, the number of clusters, rather than, n, the number of data points. Our algorithm attempts to isolate a sample from each individual cluster within a sequence of uniformly drawn samples taken from the complete data. As a result, computing the k-medoids solution using our method only involves solving k trivial sub-problems of centrality. This allows our algorithm to run in comparable time for arbitrarily large datasets with same underlying density distribution. We evaluate AGORAS experimentally against PAM and CLARANS - two of the best-known existing algorithms for the k-medoids problem - across a variety of published and synthetic datasets. We find that AGORAS outperforms PAM by up to four orders of magnitude for data sets with less than 10,000 points, and it outperforms CLARANS by two orders of magnitude on a dataset of just 64,000 points. Moreover, we find in some cases that AGORAS also outperforms in terms of cluster quality.

KW - Cluster analysis

KW - K-medoids

KW - Partitional clustering

UR - http://www.scopus.com/inward/record.url?scp=84978485240&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84978485240&partnerID=8YFLogxK

U2 - 10.1016/j.procs.2016.05.446

DO - 10.1016/j.procs.2016.05.446

M3 - Conference article

AN - SCOPUS:84978485240

VL - 80

SP - 1159

EP - 1169

JO - Procedia Computer Science

JF - Procedia Computer Science

SN - 1877-0509

ER -