Eluder dimension and the sample complexity of optimistic exploration

Daniel Russo, Benjamin Van Roy

Research output: Contribution to journalConference article

13 Citations (Scopus)

Abstract

This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models.

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
StatePublished - Jan 1 2013
Event27th Annual Conference on Neural Information Processing Systems, NIPS 2013 - Lake Tahoe, NV, United States
Duration: Dec 5 2013Dec 10 2013

Fingerprint

Sampling
Uncertainty

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

@article{c76e90c19b144026bcd51ae8657932ea,
title = "Eluder dimension and the sample complexity of optimistic exploration",
abstract = "This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models.",
author = "Daniel Russo and {Van Roy}, Benjamin",
year = "2013",
month = "1",
day = "1",
language = "English (US)",
journal = "Advances in Neural Information Processing Systems",
issn = "1049-5258",

}

Eluder dimension and the sample complexity of optimistic exploration. / Russo, Daniel; Van Roy, Benjamin.

In: Advances in Neural Information Processing Systems, 01.01.2013.

Research output: Contribution to journalConference article

TY - JOUR

T1 - Eluder dimension and the sample complexity of optimistic exploration

AU - Russo, Daniel

AU - Van Roy, Benjamin

PY - 2013/1/1

Y1 - 2013/1/1

N2 - This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models.

AB - This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models.

UR - http://www.scopus.com/inward/record.url?scp=84898958016&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898958016&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84898958016

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

ER -