TY - JOUR
T1 - Variance reduced policy evaluation with smooth function approximation
AU - Wai, Hoi To
AU - Hong, Mingyi
AU - Yang, Zhuoran
AU - Wang, Zhaoran
AU - Tang, Kexin
N1 - Funding Information:
H.-T. Wai is supported by the CUHK Direct Grant #4055113. M. Hong is supported in part by NSF under Grant CCF-1651825, CMMI-172775, CIF-1910385 and by AFOSR under grant 19RT0424.
Publisher Copyright:
© 2019 Neural information processing systems foundation. All rights reserved.
PY - 2019
Y1 - 2019
N2 - Policy evaluation with smooth and nonlinear function approximation has shown great potential for reinforcement learning. Compared to linear function approximation, it allows for using a richer class of approximation functions such as the neural networks. Traditional algorithms are based on two timescales stochastic approximation whose convergence rate is often slow. This paper focuses on an offline setting where a trajectory of m state-action pairs are observed. We formulate the policy evaluation problem as a non-convex primal-dual, finite-sum optimization problem, whose primal sub-problem is non-convex and dual sub-problem is strongly concave. We suggest a single-timescale primal-dual gradient algorithm with variance reduction, and show that it converges to an e-stationary point using O(m/e) calls (in expectation) to a gradient oracle.
AB - Policy evaluation with smooth and nonlinear function approximation has shown great potential for reinforcement learning. Compared to linear function approximation, it allows for using a richer class of approximation functions such as the neural networks. Traditional algorithms are based on two timescales stochastic approximation whose convergence rate is often slow. This paper focuses on an offline setting where a trajectory of m state-action pairs are observed. We formulate the policy evaluation problem as a non-convex primal-dual, finite-sum optimization problem, whose primal sub-problem is non-convex and dual sub-problem is strongly concave. We suggest a single-timescale primal-dual gradient algorithm with variance reduction, and show that it converges to an e-stationary point using O(m/e) calls (in expectation) to a gradient oracle.
UR - http://www.scopus.com/inward/record.url?scp=85090170071&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090170071&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85090170071
VL - 32
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
SN - 1049-5258
T2 - 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019
Y2 - 8 December 2019 through 14 December 2019
ER -