TY - GEN
T1 - To Store or Not to Store
T2 - 38th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
AU - Guo, Anxin
AU - Li, Jingwei
AU - Sukprasert, Pattara
AU - Khuller, Samir
AU - Deshpande, Amol
AU - Mukherjee, Koyel
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Dataset Versioning is extremely important for ensuring the reproducibility of results, tracking data changes over time, maintaining quality measures, enabling collaboration, and ensuring legal compliance. In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MINSUM RETRIEVAL (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This problem (along with its variants) was introduced by Bhattacherjee et al. [VLDB'15]. While such problems are frequently encountered in collaborative tools (e.g., version control systems and data analysis pipelines), to the best of our knowledge, no existing research studies the theoretical aspects of these problems.We established, in the full version of this work1, that the previous best heuristic, LMG (introduced in [VLDB'15]) can perform arbitrarily badly in a simple worst case. Moreover, we show that it is hard to get o(n)-approximation for MSR on general graphs even if we relax the storage constraints by an O(log n) factor. Similar hardness results are shown for other variants. Meanwhile, we propose poly-time approximation schemes for tree-like graphs, motivated by the fact that the graphs arising in practice from typical edit operations are often not arbitrary. As version graphs typically have low treewidth, we further develop new algorithms for bounded treewidth graphs.Furthermore, we propose two new heuristics and evaluate them empirically. First, we extend LMG by considering more potential "moves", to propose a new heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable run time on a wide variety of datasets, i.e., version graphs. Secondly, we apply our tree algorithms on the minimum-storage arborescence of an instance, yielding algorithms that are qualitatively better than all previous heuristics for MSR, as well as for another variant BOUNDEDMIN RETRIEVAL (BMR).
AB - Dataset Versioning is extremely important for ensuring the reproducibility of results, tracking data changes over time, maintaining quality measures, enabling collaboration, and ensuring legal compliance. In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MINSUM RETRIEVAL (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This problem (along with its variants) was introduced by Bhattacherjee et al. [VLDB'15]. While such problems are frequently encountered in collaborative tools (e.g., version control systems and data analysis pipelines), to the best of our knowledge, no existing research studies the theoretical aspects of these problems.We established, in the full version of this work1, that the previous best heuristic, LMG (introduced in [VLDB'15]) can perform arbitrarily badly in a simple worst case. Moreover, we show that it is hard to get o(n)-approximation for MSR on general graphs even if we relax the storage constraints by an O(log n) factor. Similar hardness results are shown for other variants. Meanwhile, we propose poly-time approximation schemes for tree-like graphs, motivated by the fact that the graphs arising in practice from typical edit operations are often not arbitrary. As version graphs typically have low treewidth, we further develop new algorithms for bounded treewidth graphs.Furthermore, we propose two new heuristics and evaluate them empirically. First, we extend LMG by considering more potential "moves", to propose a new heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable run time on a wide variety of datasets, i.e., version graphs. Secondly, we apply our tree algorithms on the minimum-storage arborescence of an instance, yielding algorithms that are qualitatively better than all previous heuristics for MSR, as well as for another variant BOUNDEDMIN RETRIEVAL (BMR).
KW - Approximation Algorithm
KW - Combinatorial Optimization
KW - Data Management
KW - Version Control Systems
UR - http://www.scopus.com/inward/record.url?scp=85198904192&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85198904192&partnerID=8YFLogxK
U2 - 10.1109/IPDPS57955.2024.00049
DO - 10.1109/IPDPS57955.2024.00049
M3 - Conference contribution
AN - SCOPUS:85198904192
T3 - Proceedings - 2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
SP - 479
EP - 493
BT - Proceedings - 2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 May 2024 through 31 May 2024
ER -