Sampling informative training data for RNN language models

Jared Fernandez*, Douglas C Downey

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We propose an unsupervised importance sampling approach to selecting training data for recurrent neural network (RNN) language models. To increase the information content of the training set, our approach preferentially samples high perplexity sentences, as determined by an easily queryable n-gram language model. We experimentally evaluate the heldout perplexity of models trained with our various importance sampling distributions. We show that language models trained on data sampled using our proposed approach outperform models trained over randomly sampled subsets of both the Billion Word (Chelba et al., 2014) and Wikitext-103 benchmark corpora (Merity et al., 2016).

Original languageEnglish (US)
Title of host publicationACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages9-13
Number of pages5
ISBN (Electronic)9781948087360
StatePublished - Jan 1 2018
Event56th Annual Meeting of the Association for Computational Linguistics, ACL 2018 - Melbourne, Australia
Duration: Jul 15 2018Jul 20 2018

Publication series

NameACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop

Conference

Conference56th Annual Meeting of the Association for Computational Linguistics, ACL 2018
CountryAustralia
CityMelbourne
Period7/15/187/20/18

Fingerprint

Recurrent neural networks
Sampling
Importance sampling

ASJC Scopus subject areas

  • Software
  • Computational Theory and Mathematics

Cite this

Fernandez, J., & Downey, D. C. (2018). Sampling informative training data for RNN language models. In ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop (pp. 9-13). (ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop). Association for Computational Linguistics (ACL).
Fernandez, Jared ; Downey, Douglas C. / Sampling informative training data for RNN language models. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop. Association for Computational Linguistics (ACL), 2018. pp. 9-13 (ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop).
@inproceedings{57316b6cf370459495d33db67efba958,
title = "Sampling informative training data for RNN language models",
abstract = "We propose an unsupervised importance sampling approach to selecting training data for recurrent neural network (RNN) language models. To increase the information content of the training set, our approach preferentially samples high perplexity sentences, as determined by an easily queryable n-gram language model. We experimentally evaluate the heldout perplexity of models trained with our various importance sampling distributions. We show that language models trained on data sampled using our proposed approach outperform models trained over randomly sampled subsets of both the Billion Word (Chelba et al., 2014) and Wikitext-103 benchmark corpora (Merity et al., 2016).",
author = "Jared Fernandez and Downey, {Douglas C}",
year = "2018",
month = "1",
day = "1",
language = "English (US)",
series = "ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop",
publisher = "Association for Computational Linguistics (ACL)",
pages = "9--13",
booktitle = "ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop",

}

Fernandez, J & Downey, DC 2018, Sampling informative training data for RNN language models. in ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, Association for Computational Linguistics (ACL), pp. 9-13, 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, 7/15/18.

Sampling informative training data for RNN language models. / Fernandez, Jared; Downey, Douglas C.

ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop. Association for Computational Linguistics (ACL), 2018. p. 9-13 (ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Sampling informative training data for RNN language models

AU - Fernandez, Jared

AU - Downey, Douglas C

PY - 2018/1/1

Y1 - 2018/1/1

N2 - We propose an unsupervised importance sampling approach to selecting training data for recurrent neural network (RNN) language models. To increase the information content of the training set, our approach preferentially samples high perplexity sentences, as determined by an easily queryable n-gram language model. We experimentally evaluate the heldout perplexity of models trained with our various importance sampling distributions. We show that language models trained on data sampled using our proposed approach outperform models trained over randomly sampled subsets of both the Billion Word (Chelba et al., 2014) and Wikitext-103 benchmark corpora (Merity et al., 2016).

AB - We propose an unsupervised importance sampling approach to selecting training data for recurrent neural network (RNN) language models. To increase the information content of the training set, our approach preferentially samples high perplexity sentences, as determined by an easily queryable n-gram language model. We experimentally evaluate the heldout perplexity of models trained with our various importance sampling distributions. We show that language models trained on data sampled using our proposed approach outperform models trained over randomly sampled subsets of both the Billion Word (Chelba et al., 2014) and Wikitext-103 benchmark corpora (Merity et al., 2016).

UR - http://www.scopus.com/inward/record.url?scp=85063074407&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063074407&partnerID=8YFLogxK

M3 - Conference contribution

T3 - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop

SP - 9

EP - 13

BT - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop

PB - Association for Computational Linguistics (ACL)

ER -

Fernandez J, Downey DC. Sampling informative training data for RNN language models. In ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop. Association for Computational Linguistics (ACL). 2018. p. 9-13. (ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop).