On constructing an optimal consensus clustering from multiple clusterings

Piotr Berman, Bhaskar DasGupta*, Ming-Yang Kao, Jie Wang

*Corresponding author for this work

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Computing a suitable measure of consensus among several clusterings on the same data is an important problem that arises in several areas such as computational biology and data mining. In this paper, we formalize a set-theoretic model for computing such a similarity measure. Roughly speaking, in this model we have k > 1 partitions (clusters) of the same data set each containing the same number of sets and the goal is to align the sets in each partition to minimize a similarity measure. For k = 2, a polynomial-time solution was proposed by Gusfield (Information Processing Letters 82 (2002) 159-164). In this paper, we show that the problem is MAX-SNP-hard for k = 3 even if each partition in each cluster contains no more than 2 elements and provide a 2 - frac(2, k)-approximation algorithm for the problem for any k.

Original languageEnglish (US)
Pages (from-to)137-145
Number of pages9
JournalInformation Processing Letters
Volume104
Issue number4
DOIs
StatePublished - Nov 15 2007

Fingerprint

Clustering
Partition
Approximation algorithms
Similarity Measure
Data mining
Polynomials
Computing
Computational Biology
Information Processing
Approximation Algorithms
Data Mining
Polynomial time
Minimise
Model

Keywords

  • Approximation algorithms
  • Computational complexity
  • Consensus clustering

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Signal Processing
  • Information Systems
  • Computer Science Applications

Cite this

Berman, Piotr ; DasGupta, Bhaskar ; Kao, Ming-Yang ; Wang, Jie. / On constructing an optimal consensus clustering from multiple clusterings. In: Information Processing Letters. 2007 ; Vol. 104, No. 4. pp. 137-145.
@article{0e49c96227654a8b949e32ee5e82de82,
title = "On constructing an optimal consensus clustering from multiple clusterings",
abstract = "Computing a suitable measure of consensus among several clusterings on the same data is an important problem that arises in several areas such as computational biology and data mining. In this paper, we formalize a set-theoretic model for computing such a similarity measure. Roughly speaking, in this model we have k > 1 partitions (clusters) of the same data set each containing the same number of sets and the goal is to align the sets in each partition to minimize a similarity measure. For k = 2, a polynomial-time solution was proposed by Gusfield (Information Processing Letters 82 (2002) 159-164). In this paper, we show that the problem is MAX-SNP-hard for k = 3 even if each partition in each cluster contains no more than 2 elements and provide a 2 - frac(2, k)-approximation algorithm for the problem for any k.",
keywords = "Approximation algorithms, Computational complexity, Consensus clustering",
author = "Piotr Berman and Bhaskar DasGupta and Ming-Yang Kao and Jie Wang",
year = "2007",
month = "11",
day = "15",
doi = "10.1016/j.ipl.2007.06.008",
language = "English (US)",
volume = "104",
pages = "137--145",
journal = "Information Processing Letters",
issn = "0020-0190",
publisher = "Elsevier",
number = "4",

}

On constructing an optimal consensus clustering from multiple clusterings. / Berman, Piotr; DasGupta, Bhaskar; Kao, Ming-Yang; Wang, Jie.

In: Information Processing Letters, Vol. 104, No. 4, 15.11.2007, p. 137-145.

Research output: Contribution to journalArticle

TY - JOUR

T1 - On constructing an optimal consensus clustering from multiple clusterings

AU - Berman, Piotr

AU - DasGupta, Bhaskar

AU - Kao, Ming-Yang

AU - Wang, Jie

PY - 2007/11/15

Y1 - 2007/11/15

N2 - Computing a suitable measure of consensus among several clusterings on the same data is an important problem that arises in several areas such as computational biology and data mining. In this paper, we formalize a set-theoretic model for computing such a similarity measure. Roughly speaking, in this model we have k > 1 partitions (clusters) of the same data set each containing the same number of sets and the goal is to align the sets in each partition to minimize a similarity measure. For k = 2, a polynomial-time solution was proposed by Gusfield (Information Processing Letters 82 (2002) 159-164). In this paper, we show that the problem is MAX-SNP-hard for k = 3 even if each partition in each cluster contains no more than 2 elements and provide a 2 - frac(2, k)-approximation algorithm for the problem for any k.

AB - Computing a suitable measure of consensus among several clusterings on the same data is an important problem that arises in several areas such as computational biology and data mining. In this paper, we formalize a set-theoretic model for computing such a similarity measure. Roughly speaking, in this model we have k > 1 partitions (clusters) of the same data set each containing the same number of sets and the goal is to align the sets in each partition to minimize a similarity measure. For k = 2, a polynomial-time solution was proposed by Gusfield (Information Processing Letters 82 (2002) 159-164). In this paper, we show that the problem is MAX-SNP-hard for k = 3 even if each partition in each cluster contains no more than 2 elements and provide a 2 - frac(2, k)-approximation algorithm for the problem for any k.

KW - Approximation algorithms

KW - Computational complexity

KW - Consensus clustering

UR - http://www.scopus.com/inward/record.url?scp=34548032601&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34548032601&partnerID=8YFLogxK

U2 - 10.1016/j.ipl.2007.06.008

DO - 10.1016/j.ipl.2007.06.008

M3 - Article

AN - SCOPUS:34548032601

VL - 104

SP - 137

EP - 145

JO - Information Processing Letters

JF - Information Processing Letters

SN - 0020-0190

IS - 4

ER -