To Store or Not to Store: a graph theoretical approach for Dataset Versioning

Anxin Guo*, Jingwei Li, Pattara Sukprasert, Samir Khuller, Amol Deshpande, Koyel Mukherjee

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Dataset Versioning is extremely important for ensuring the reproducibility of results, tracking data changes over time, maintaining quality measures, enabling collaboration, and ensuring legal compliance. In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MINSUM RETRIEVAL (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This problem (along with its variants) was introduced by Bhattacherjee et al. [VLDB'15]. While such problems are frequently encountered in collaborative tools (e.g., version control systems and data analysis pipelines), to the best of our knowledge, no existing research studies the theoretical aspects of these problems.We established, in the full version of this work1, that the previous best heuristic, LMG (introduced in [VLDB'15]) can perform arbitrarily badly in a simple worst case. Moreover, we show that it is hard to get o(n)-approximation for MSR on general graphs even if we relax the storage constraints by an O(log n) factor. Similar hardness results are shown for other variants. Meanwhile, we propose poly-time approximation schemes for tree-like graphs, motivated by the fact that the graphs arising in practice from typical edit operations are often not arbitrary. As version graphs typically have low treewidth, we further develop new algorithms for bounded treewidth graphs.Furthermore, we propose two new heuristics and evaluate them empirically. First, we extend LMG by considering more potential "moves", to propose a new heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable run time on a wide variety of datasets, i.e., version graphs. Secondly, we apply our tree algorithms on the minimum-storage arborescence of an instance, yielding algorithms that are qualitatively better than all previous heuristics for MSR, as well as for another variant BOUNDEDMIN RETRIEVAL (BMR).

Original languageEnglish (US)
Title of host publicationProceedings - 2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages479-493
Number of pages15
ISBN (Electronic)9798350337662
DOIs
StatePublished - 2024
Event38th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024 - San Francisco, United States
Duration: May 27 2024May 31 2024

Publication series

NameProceedings - 2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024

Conference

Conference38th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
Country/TerritoryUnited States
CitySan Francisco
Period5/27/245/31/24

Keywords

  • Approximation Algorithm
  • Combinatorial Optimization
  • Data Management
  • Version Control Systems

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'To Store or Not to Store: a graph theoretical approach for Dataset Versioning'. Together they form a unique fingerprint.

Cite this