DeepSqueeze: Deep Semantic Compression for Tabular Data

Amir Ilkhechi, Andrew Crotty, Alex Galakatos, Yicong Mao, Grace Fan, Xiran Shi, Ugur Cetintemel

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Scopus citations

Abstract

With the rapid proliferation of large datasets, efficient data compression has become more important than ever. Columnar compression techniques (e.g., dictionary encoding, run-length encoding, delta encoding) have proved highly effective for tabular data, but they typically compress individual columns without considering potential relationships among columns, such as functional dependencies and correlations. Semantic compression techniques, on the other hand, are designed to leverage such relationships to store only a subset of the columns necessary to infer the others, but existing approaches cannot effectively identify complex relationships across more than a few columns at a time. We propose DeepSqueeze, a novel semantic compression framework that can efficiently capture these complex relationships within tabular data by using autoencoders to map tuples to a lower-dimensional representation. DeepSqueeze also supports guaranteed error bounds for lossy compression of numerical data and works in conjunction with common columnar compression formats. Our experimental evaluation uses real-world datasets to demonstrate that DeepSqueeze can achieve over a 4x size reduction compared to state-of-the-art alternatives.

Original languageEnglish (US)
Title of host publicationSIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1733-1746
Number of pages14
ISBN (Electronic)9781450367356
DOIs
StatePublished - Jun 14 2020
Event2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020 - Portland, United States
Duration: Jun 14 2020Jun 19 2020

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020
Country/TerritoryUnited States
CityPortland
Period6/14/206/19/20

Funding

We would like to thank the anonymous reviewers and shepherd for their helpful feedback. This work was funded in part by NSF IIS-1526639 and IIS-1514491.

Keywords

  • data compression
  • semantic compression

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'DeepSqueeze: Deep Semantic Compression for Tabular Data'. Together they form a unique fingerprint.

Cite this