Abstract
We present an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces. Our algorithm is based on a novel Q-learning policy with adaptive data-driven discretization. The central idea is to maintain a finer partition of the state-action space in regions which are frequently visited in historical trajectories, and have higher payoff estimates. We demonstrate how our adaptive partitions take advantage of the shape of the optimal Q-function and the joint space, without sacrificing the worst-case performance. In particular, we recover the regret guarantees of prior algorithms for continuous state-action spaces, which additionally require either an optimal discretization as input, and/or access to a simulation oracle. Moreover, experiments demonstrate how our algorithm automatically adapts to the underlying structure of the problem, resulting in much better performance compared both to heuristics and Q-learning with uniform discretization.
Original language | English (US) |
---|---|
Pages (from-to) | 17-18 |
Number of pages | 2 |
Journal | Performance Evaluation Review |
Volume | 48 |
Issue number | 1 |
DOIs | |
State | Published - Jul 8 2020 |
Funding
The full paper is available at https://dl.acm.org/doi/10.1145/3366703. The code for the experiments is available at https://github.com/seanrsinclair/AdaptiveQLearning. We gratefully acknowledge funding from the NSF under grants ECCS-1847393 and DMS-1839346, and the ARL under grant W911NF-17-1-0094.
Keywords
- adaptive discretization
- metric spaces
- model-free
- qlearning
- reinforcement learning
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Networks and Communications