Abstract
In the past decade, academia and industry have embraced machine learning (ML) for database management system (DBMS) automation. These efforts have focused on designing ML models that predict DBMS behavior to support picking actions (e.g., building indexes) that improve the system's performance. Recent developments in ML have created automated methods for finding good models. Such advances shift the bottleneck from DBMS model design to obtaining the training data necessary for building these models. But generating good training data is challenging and requires encoding subject matter expertise into DBMS instrumentation. Existing methods for training data collection are bespoke to individual DBMS components and do not account for (1) how workload trends affect the system and (2) the subtle interactions between internal system components. Consequently, the models created from this data do not support holistic tuning across subsystems and require frequent retraining to boost their accuracy. This paper presents the architecture of a database gym, an integrated environment that provides a unified API of pluggable components for obtaining high-quality training data. The goal of a database gym is to simplify ML model training and evaluation to accelerate autonomous DBMS research. But unlike gyms in other domains that rely on custom simulators, a database gym uses the DBMS itself to create simulation environments for ML training. Thus, we discuss and prescribe methods for overcoming challenges in DBMS simulation, which include demanding requirements for performance, simulation fidelity, and DBMS-generated hints for guiding training processes.
Original language | English (US) |
---|---|
State | Published - 2023 |
Event | 13th Annual Conference on Innovative Data Systems Research, CIDR 2023 - Amsterdam, Netherlands Duration: Jan 8 2023 → Jan 11 2023 |
Conference
Conference | 13th Annual Conference on Innovative Data Systems Research, CIDR 2023 |
---|---|
Country/Territory | Netherlands |
City | Amsterdam |
Period | 1/8/23 → 1/11/23 |
Funding
This work was supported (in part) by the National Science Foundation (IIS-1846158, SPX-1822933), VMware Research Grants for Databases, Google DAPA Research Grants, and the Alfred P. Sloan Research Fellowship program. TKBM.
ASJC Scopus subject areas
- Information Systems and Management
- Artificial Intelligence
- Information Systems
- Hardware and Architecture