Achieving anonymity via clustering

Gagan Aggarwal*, Tomás Feder, Krishnaram Kenthapadi, Samir Khuller, Rina Panigrahy, Dilys Thomas, An Zhu

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

192 Scopus citations

Abstract

Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of de- identifying records is to remove identifying fields such as social security number, name etc. However, recent research has shown that a large fraction of the US population can be identified using non-key attributes (called quasi-identifiers) such as date of birth, gender, and zip code [15]. Sweeney [16] proposed the k-anonymity model for privacy where non-key attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least k-1 other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a pre-specified number of data records. This technique is more general since we have a much larger choice for cluster centers than k-Anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant-factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an ε fraction of points to remain unclustered, i.e., deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clustering scenarios as well.

Original languageEnglish (US)
Title of host publicationProceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2006
Pages153-162
Number of pages10
DOIs
StatePublished - Dec 1 2006
Event25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2006 - Chicago, IL, United States
Duration: Jun 26 2006Jun 28 2006

Publication series

NameProceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Conference

Conference25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2006
CountryUnited States
CityChicago, IL
Period6/26/066/28/06

Keywords

  • Anonymity
  • Approximation algorithms
  • Clustering
  • Privacy

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'Achieving anonymity via clustering'. Together they form a unique fingerprint.

Cite this