(More) efficient reinforcement learning via posterior sampling

Ian Osband, Benjamin Van Roy, Daniel Russo

Research output: Contribution to journalConference article

66 Citations (Scopus)

Abstract

Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an Õ(τ S√AT) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
StatePublished - Jan 1 2013
Event27th Annual Conference on Neural Information Processing Systems, NIPS 2013 - Lake Tahoe, NV, United States
Duration: Dec 5 2013Dec 10 2013

Fingerprint

Reinforcement learning
Sampling
Learning algorithms

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

@article{4d46c2d7c2e54ae39375e563c619eba9,
title = "(More) efficient reinforcement learning via posterior sampling",
abstract = "Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an {\~O}(τ S√AT) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.",
author = "Ian Osband and {Van Roy}, Benjamin and Daniel Russo",
year = "2013",
month = "1",
day = "1",
language = "English (US)",
journal = "Advances in Neural Information Processing Systems",
issn = "1049-5258",

}

(More) efficient reinforcement learning via posterior sampling. / Osband, Ian; Van Roy, Benjamin; Russo, Daniel.

In: Advances in Neural Information Processing Systems, 01.01.2013.

Research output: Contribution to journalConference article

TY - JOUR

T1 - (More) efficient reinforcement learning via posterior sampling

AU - Osband, Ian

AU - Van Roy, Benjamin

AU - Russo, Daniel

PY - 2013/1/1

Y1 - 2013/1/1

N2 - Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an Õ(τ S√AT) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

AB - Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an Õ(τ S√AT) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

UR - http://www.scopus.com/inward/record.url?scp=84899019264&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84899019264&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84899019264

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

ER -