FakeDB: Generating Fake Synthetic Databases

Chongyang Gao, Sushil Jajodia, Andrea Pugliese, V. S. Subrahmanian*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Health care providers may wish to share limited information with researchers. Manufacturing companies may want to share some but not all data with regulators or partners. Since the emergence of generative adversarial networks (GANs), efforts have been made to generate synthetic data that preserves semantic properties on the one hand and distributions on the other hand. However, all past efforts focus on a single table at a time. We propose FakeDB, a general framework to generate synthetic data that preserves a a wide variety of semantic integrity constraints as well as a broad set of statistical properties, across an entire relational database. We compare FakeDB with natural extensions of prior work on 8 well known relational databases as well as on a synthetically generated dataset, and show that FakeDB outperforms them. We also show that FakeDB runs in reasonable amounts of time, making it a practical solution to the problem of generating synthetic data.

Original languageEnglish (US)
Pages (from-to)5553-5564
Number of pages12
JournalIEEE Transactions on Dependable and Secure Computing
Volume21
Issue number6
DOIs
StatePublished - 2024

Funding

This work was supported by ONR under Grant N00014-18-1-2670 and Grant N00014-20-1-2407.

Keywords

  • H.2.0.a security
  • H.2.4.i relational databases < H.2.4 systems < H.2 database management < H information technology and systems
  • and protection < H.2.0 general < H.2 database management < H information technology and systems
  • integrity

ASJC Scopus subject areas

  • General Computer Science
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'FakeDB: Generating Fake Synthetic Databases'. Together they form a unique fingerprint.

Cite this