TY - GEN
T1 - Don't stop pretraining
T2 - 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020
AU - Gururangan, Suchin
AU - Marasovic, Ana
AU - Swayamdipta, Swabha
AU - Lo, Kyle
AU - Beltagy, Iz
AU - Downey, Doug
AU - Smith, Noah A.
N1 - Funding Information:
The authors thank Dallas Card, Mark Neumann, Nelson Liu, Eric Wallace, members of the Al-lenNLP team, and anonymous reviewers for helpful feedback, and Arman Cohan for providing data. This research was supported in part by the Office of Naval Research under the MURI grant N00014-18-1-2670.
Publisher Copyright:
© 2020 Association for Computational Linguistics
PY - 2020
Y1 - 2020
N2 - Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multiphase adaptive pretraining offers large gains in task performance.
AB - Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multiphase adaptive pretraining offers large gains in task performance.
UR - http://www.scopus.com/inward/record.url?scp=85117904928&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117904928&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85117904928
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 8342
EP - 8360
BT - ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
Y2 - 5 July 2020 through 10 July 2020
ER -