Abstract
Health care providers may wish to share limited information with researchers. Manufacturing companies may want to share some but not all data with regulators or partners. Since the emergence of generative adversarial networks (GANs), efforts have been made to generate synthetic data that preserves semantic properties on the one hand and distributions on the other hand. However, all past efforts focus on a single table at a time. We propose FakeDB, a general framework to generate synthetic data that preserves a a wide variety of semantic integrity constraints as well as a broad set of statistical properties, across an entire relational database. We compare FakeDB with natural extensions of prior work on 8 well known relational databases as well as on a synthetically generated dataset, and show that FakeDB outperforms them. We also show that FakeDB runs in reasonable amounts of time, making it a practical solution to the problem of generating synthetic data.
Original language | English (US) |
---|---|
Pages (from-to) | 5553-5564 |
Number of pages | 12 |
Journal | IEEE Transactions on Dependable and Secure Computing |
Volume | 21 |
Issue number | 6 |
DOIs | |
State | Published - 2024 |
Funding
This work was supported by ONR under Grant N00014-18-1-2670 and Grant N00014-20-1-2407.
Keywords
- H.2.0.a security
- H.2.4.i relational databases < H.2.4 systems < H.2 database management < H information technology and systems
- and protection < H.2.0 general < H.2 database management < H information technology and systems
- integrity
ASJC Scopus subject areas
- General Computer Science
- Electrical and Electronic Engineering