Abstract
We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and yields new insight into how information improves performance.
Original language | English (US) |
---|---|
Journal | Journal of Machine Learning Research |
Volume | 17 |
State | Published - Apr 1 2016 |
Keywords
- Information theory
- Mutli-armed bandit
- Online optimization
- Regret bounds
- Thompson sampling
ASJC Scopus subject areas
- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence