A moment matching approach for generating synthetic data

Brittany Megan Bogle*, Sanjay Mehrotra

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Synthetic data are becoming increasingly important mechanisms for sharing data among collaborators and with the public. Multiple methods for the generation of synthetic data have been proposed, but many have short comings with respect to maintaining the statistical properties of the original data. We propose a new method for fully synthetic data generation that leverages linear and integer mathematical programming models in order to match the moments of the original data in the synthetic data. This method has no inherent disclosure risk and does not require parametric or distributional assumptions. We demonstrate this methodology using the Framingham Heart Study. Existing synthetic data methods that use chained equations were compared with our approach. We fit Cox proportional hazards, logistic regression, and nonparametric models to synthetic data and compared with models fitted to the original data. True coverage, the proportion of synthetic data parameter confidence intervals that include the original data's parameter estimate, was 100% for parametric models when up to four moments were matched, and consistently outperformed the chained equations approach. The area under the curve and accuracy of the nonparametric models trained on synthetic data marginally differed when tested on the full original data. Models were also trained on synthetic data and a partition of original data and were tested on a held-out portion of original data. Fourth-order moment matched synthetic data outperformed others with respect to fitted parametric models but did not always outperform other methods with fitted nonparametric models. No single synthetic data method consistently outperformed others when assessing the performance of nonparametric models. The performance of fourth-order moment matched synthetic data in fitting parametric models suggests its use in these cases. Our empirical results also suggest that the performance of synthetic data generation techniques, including the moment matching approach, is less stable for use with nonparametric models. The benefits of the moment matching approach should be weighed against additional computational costs. In summary, our results demonstrate that the introduced moment matching approach may be considered as an alternative to existing synthetic data generation methods.

Original languageEnglish (US)
Pages (from-to)160-178
Number of pages19
JournalBig Data
Volume4
Issue number3
DOIs
StatePublished - Sep 1 2016

Keywords

  • fully synthetic data
  • mathematical programming
  • moment matching
  • optimization
  • synthetic data

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Information Systems and Management

Fingerprint Dive into the research topics of 'A moment matching approach for generating synthetic data'. Together they form a unique fingerprint.

Cite this