TY - GEN
T1 - DBPal
T2 - 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020
AU - Weir, Nathaniel
AU - Utama, Prasetya
AU - Galakatos, Alex
AU - Crotty, Andrew
AU - Ilkhechi, Amir
AU - Ramaswamy, Shekar
AU - Bhushan, Rohin
AU - Geisler, Nadja
AU - Hättasch, Benjamin
AU - Eger, Steffen
AU - Cetintemel, Ugur
AU - Binnig, Carsten
N1 - Funding Information:
This work was funded in part by NSF grants III:1526639 and III:1514491, as well as gifts from Oracle to support our work on Natural Language Interfaces on Big Data.
Publisher Copyright:
© 2020 Association for Computing Machinery.
PY - 2020/6/14
Y1 - 2020/6/14
N2 - Natural language is a promising alternative interface to DBMSs because it enables non-technical users to formulate complex questions in a more concise manner than SQL. Recently, deep learning has gained traction for translating natural language to SQL, since similar ideas have been successful in the related domain of machine translation. However, the core problem with existing deep learning approaches is that they require an enormous amount of training data in order to provide accurate translations. This training data is extremely expensive to curate, since it generally requires humans to manually annotate natural language examples with the corresponding SQL queries (or vice versa). Based on these observations, we propose DBPal, a new approach that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. More specifically, we present a novel training pipeline that automatically generates synthetic training data in order to (1) improve overall translation accuracy, (2) increase robustness to linguistic variation, and (3) specialize the model for the target database. As we show, our DBPal training pipeline is able to improve both the accuracy and linguistic robustness of state-of-the-art natural language to SQL translation models.
AB - Natural language is a promising alternative interface to DBMSs because it enables non-technical users to formulate complex questions in a more concise manner than SQL. Recently, deep learning has gained traction for translating natural language to SQL, since similar ideas have been successful in the related domain of machine translation. However, the core problem with existing deep learning approaches is that they require an enormous amount of training data in order to provide accurate translations. This training data is extremely expensive to curate, since it generally requires humans to manually annotate natural language examples with the corresponding SQL queries (or vice versa). Based on these observations, we propose DBPal, a new approach that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. More specifically, we present a novel training pipeline that automatically generates synthetic training data in order to (1) improve overall translation accuracy, (2) increase robustness to linguistic variation, and (3) specialize the model for the target database. As we show, our DBPal training pipeline is able to improve both the accuracy and linguistic robustness of state-of-the-art natural language to SQL translation models.
KW - NL2SQL
KW - NLIDB
KW - natural language interface to database
KW - natural language to SQL
UR - http://www.scopus.com/inward/record.url?scp=85086267160&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85086267160&partnerID=8YFLogxK
U2 - 10.1145/3318464.3380589
DO - 10.1145/3318464.3380589
M3 - Conference contribution
AN - SCOPUS:85086267160
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 2347
EP - 2361
BT - SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 14 June 2020 through 19 June 2020
ER -