Abstract
While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for two-player zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight Oe(√T) regret bounds after T steps in a two-agent competitive game scenario. The regret of each player is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is Oe(√T).
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the 38th International Conference on Machine Learning, ICML 2021 |
Publisher | ML Research Press |
Pages | 8715-8725 |
Number of pages | 11 |
ISBN (Electronic) | 9781713845065 |
State | Published - 2021 |
Event | 38th International Conference on Machine Learning, ICML 2021 - Virtual, Online Duration: Jul 18 2021 → Jul 24 2021 |
Publication series
Name | Proceedings of Machine Learning Research |
---|---|
Volume | 139 |
ISSN (Electronic) | 2640-3498 |
Conference
Conference | 38th International Conference on Machine Learning, ICML 2021 |
---|---|
City | Virtual, Online |
Period | 7/18/21 → 7/24/21 |
Funding
Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports. Zhuoran Yang acknowledges Simons Institute (Theory of Reinforcement Learning).
ASJC Scopus subject areas
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability