Generative data augmentation for commonsense reasoning

Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji Ping Wang, Chandra Bhagavatula, Yejin Choi, Doug Downey

Research output: Chapter in Book/Report/Conference proceedingConference contribution

95 Scopus citations

Abstract

Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAUGc, that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. On experiments with multiple commonsense reasoning benchmarks, G-DAUGc consistently outperforms existing data augmentation methods based on back-translation, establishing a new state-of-the-art on WINOGRANDE, CODAH, and COMMONSENSEQA, and also enhances out-of-distribution generalization, proving to be more robust against adversaries or perturbations. Our analysis demonstrates that G-DAUGc produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.

Original languageEnglish (US)
Title of host publicationFindings of the Association for Computational Linguistics Findings of ACL
Subtitle of host publicationEMNLP 2020
PublisherAssociation for Computational Linguistics (ACL)
Pages1008-1025
Number of pages18
ISBN (Electronic)9781952148903
StatePublished - 2020
EventFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020 - Virtual, Online
Duration: Nov 16 2020Nov 20 2020

Publication series

NameFindings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020

Conference

ConferenceFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020
CityVirtual, Online
Period11/16/2011/20/20

Funding

This work was supported in part by NSF Grant IIS-1351029. We thank Iz Beltagy, Jonathan Bragg, Isabel Cachola, Arman Cohan, Mike D’Arcy, Daniel King, Kyle Lo, and Lucy Lu Wang for helpful comments.

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Generative data augmentation for commonsense reasoning'. Together they form a unique fingerprint.

Cite this