TY - GEN
T1 - Sampling informative training data for RNN language models
AU - Fernandez, Jared
AU - Downey, Douglas C
N1 - Publisher Copyright:
© 2018 Association for Computational Linguistics.
PY - 2018
Y1 - 2018
N2 - We propose an unsupervised importance sampling approach to selecting training data for recurrent neural network (RNN) language models. To increase the information content of the training set, our approach preferentially samples high perplexity sentences, as determined by an easily queryable n-gram language model. We experimentally evaluate the heldout perplexity of models trained with our various importance sampling distributions. We show that language models trained on data sampled using our proposed approach outperform models trained over randomly sampled subsets of both the Billion Word (Chelba et al., 2014) and Wikitext-103 benchmark corpora (Merity et al., 2016).
AB - We propose an unsupervised importance sampling approach to selecting training data for recurrent neural network (RNN) language models. To increase the information content of the training set, our approach preferentially samples high perplexity sentences, as determined by an easily queryable n-gram language model. We experimentally evaluate the heldout perplexity of models trained with our various importance sampling distributions. We show that language models trained on data sampled using our proposed approach outperform models trained over randomly sampled subsets of both the Billion Word (Chelba et al., 2014) and Wikitext-103 benchmark corpora (Merity et al., 2016).
UR - http://www.scopus.com/inward/record.url?scp=85063074407&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063074407&partnerID=8YFLogxK
U2 - 10.18653/v1/p18-3002
DO - 10.18653/v1/p18-3002
M3 - Conference contribution
AN - SCOPUS:85063074407
T3 - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
SP - 9
EP - 13
BT - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
PB - Association for Computational Linguistics (ACL)
T2 - 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018
Y2 - 15 July 2018 through 20 July 2018
ER -