Wan Shen Lim

Database Gyms: Towards Autonomous Database Tuning

Abstract

Database management systems (DBMSs) are the foundation of modern data-intensive applications. But as more features are developed to support new workloads, they become increasingly complex and difficult to configure. Thus, researchers have invested decades of effort into autonomous DBMS configuration. Recent advances in machine learning (ML) have produced tools that outperform unassisted experts in real-world deployments. However, these tools are advisory and require human expertise for their deployment into database tuning pipelines. Using these tools involves a multi-step process where a human operator (1) determines an optimization objective, (2) selects a suitable tool to improve the objective, (3) sets up and configures the DBMS to run a particular workload, (4) runs the workload to collect telemetry, (5) uses the collected telemetry to calibrate the tool, and (6) operates the tool to obtain recommendations, which the operator must then review and apply. Because of the ad-hoc nature of these pipelines, they require significant human effort to set up, extend, and deploy. Moreover, these tools are difficult to compose and swap. Even if two tools are designed for the same task, differences in their interfaces preclude their interchangeability. Given these challenges, despite the demonstrated ability of database tuning tools to improve performance and lower costs, adoption remains limited due to the substantial human expertise required for their operation.

This dissertation presents the database gym, an integrated framework that systematizes and automates the DBMS configuration pipeline. Unlike prior research that focused on improving tool effectiveness with ML, the gym aims to address challenges in the deployment and operation of existing tools by providing a set of reusable, interoperable, and interchangeable components that simplify their development and integration. The gym is designed to address the observation that the bottleneck in the database tuning process has shifted from developing better algorithms for tools to acquiring the training data needed to operate them, elevating training data's importance.

In this dissertation, we demonstrate how the gym's architecture accelerates and adapts tool-based database tuning pipelines through the systematic generation and utilization of their training data, enabling the augmentation and orchestration of tools with end-to-end knowledge. The gym leverages its complete control over the tuning process to enable holistic optimizations that span the entire pipeline. For instance, it reduces step-level overhead by skipping redundant computation during telemetry collection, thus reducing the tuning pipeline's latency. It also eliminates pipeline-level repetition by reusing past experience to adapt a tool's calibrated models to new environments, accounting for semantic differences across software versions and hardware environments.

The techniques in this dissertation show how a training-data-centric approach to tool-based database tuning accelerates tool operation and adapts tool assumptions to new environments, with the database gym architecture providing a framework that simplifies tool development while enabling novel optimizations.

Thesis Committee

Keywords

Thesis Document