Database Gyms

Wan Shen Lim, Andrew Crotty, Matthew Butrovich, Lin Ma, William Zhang, Peijing Xu, Johannes Gehrke, Andrew Pavlo

Research output: Contribution to conferencePaperpeer-review

Abstract

In the past decade, academia and industry have embraced machine learning (ML) for database management system (DBMS) automation. These efforts have focused on designing ML models that predict DBMS behavior to support picking actions (e.g., building indexes) that improve the system's performance. Recent developments in ML have created automated methods for finding good models. Such advances shift the bottleneck from DBMS model design to obtaining the training data necessary for building these models. But generating good training data is challenging and requires encoding subject matter expertise into DBMS instrumentation. Existing methods for training data collection are bespoke to individual DBMS components and do not account for (1) how workload trends affect the system and (2) the subtle interactions between internal system components. Consequently, the models created from this data do not support holistic tuning across subsystems and require frequent retraining to boost their accuracy. This paper presents the architecture of a database gym, an integrated environment that provides a unified API of pluggable components for obtaining high-quality training data. The goal of a database gym is to simplify ML model training and evaluation to accelerate autonomous DBMS research. But unlike gyms in other domains that rely on custom simulators, a database gym uses the DBMS itself to create simulation environments for ML training. Thus, we discuss and prescribe methods for overcoming challenges in DBMS simulation, which include demanding requirements for performance, simulation fidelity, and DBMS-generated hints for guiding training processes.

Original languageEnglish (US)
StatePublished - 2023
Event13th Annual Conference on Innovative Data Systems Research, CIDR 2023 - Amsterdam, Netherlands
Duration: Jan 8 2023Jan 11 2023

Conference

Conference13th Annual Conference on Innovative Data Systems Research, CIDR 2023
Country/TerritoryNetherlands
CityAmsterdam
Period1/8/231/11/23

Funding

This work was supported (in part) by the National Science Foundation (IIS-1846158, SPX-1822933), VMware Research Grants for Databases, Google DAPA Research Grants, and the Alfred P. Sloan Research Fellowship program. TKBM.

ASJC Scopus subject areas

  • Information Systems and Management
  • Artificial Intelligence
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Database Gyms'. Together they form a unique fingerprint.

Cite this