Exponentially weighted imitation learning for batched historical data

Qing Wang, Jiechao Xiong, Lei Han, Peng Sun, Han Liu, Tong Zhang

Research output: Contribution to journalConference article

Abstract

We consider deep policy learning with only batched historical trajectories. The main challenge of this problem is that the learner no longer has a simulator or “environment oracle” as in most reinforcement learning settings. To solve this problem, we propose a monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space. The method does not rely on the knowledge of the behavior policy, thus can be used to learn from data generated by an unknown policy. Under mild conditions, our algorithm, though surprisingly simple, has a policy improvement bound and outperforms most competing methods empirically. Thorough numerical results are also provided to demonstrate the efficacy of the proposed methodology.

Original languageEnglish (US)
Pages (from-to)6288-6297
Number of pages10
JournalAdvances in Neural Information Processing Systems
Volume2018-December
StatePublished - Jan 1 2018
Event32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada
Duration: Dec 2 2018Dec 8 2018

Fingerprint

Reinforcement learning
Simulators
Trajectories

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

Wang, Q., Xiong, J., Han, L., Sun, P., Liu, H., & Zhang, T. (2018). Exponentially weighted imitation learning for batched historical data. Advances in Neural Information Processing Systems, 2018-December, 6288-6297.
Wang, Qing ; Xiong, Jiechao ; Han, Lei ; Sun, Peng ; Liu, Han ; Zhang, Tong. / Exponentially weighted imitation learning for batched historical data. In: Advances in Neural Information Processing Systems. 2018 ; Vol. 2018-December. pp. 6288-6297.
@article{b601b1d49df44df3bef84f05dc9347ec,
title = "Exponentially weighted imitation learning for batched historical data",
abstract = "We consider deep policy learning with only batched historical trajectories. The main challenge of this problem is that the learner no longer has a simulator or “environment oracle” as in most reinforcement learning settings. To solve this problem, we propose a monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space. The method does not rely on the knowledge of the behavior policy, thus can be used to learn from data generated by an unknown policy. Under mild conditions, our algorithm, though surprisingly simple, has a policy improvement bound and outperforms most competing methods empirically. Thorough numerical results are also provided to demonstrate the efficacy of the proposed methodology.",
author = "Qing Wang and Jiechao Xiong and Lei Han and Peng Sun and Han Liu and Tong Zhang",
year = "2018",
month = "1",
day = "1",
language = "English (US)",
volume = "2018-December",
pages = "6288--6297",
journal = "Advances in Neural Information Processing Systems",
issn = "1049-5258",

}

Wang, Q, Xiong, J, Han, L, Sun, P, Liu, H & Zhang, T 2018, 'Exponentially weighted imitation learning for batched historical data', Advances in Neural Information Processing Systems, vol. 2018-December, pp. 6288-6297.

Exponentially weighted imitation learning for batched historical data. / Wang, Qing; Xiong, Jiechao; Han, Lei; Sun, Peng; Liu, Han; Zhang, Tong.

In: Advances in Neural Information Processing Systems, Vol. 2018-December, 01.01.2018, p. 6288-6297.

Research output: Contribution to journalConference article

TY - JOUR

T1 - Exponentially weighted imitation learning for batched historical data

AU - Wang, Qing

AU - Xiong, Jiechao

AU - Han, Lei

AU - Sun, Peng

AU - Liu, Han

AU - Zhang, Tong

PY - 2018/1/1

Y1 - 2018/1/1

N2 - We consider deep policy learning with only batched historical trajectories. The main challenge of this problem is that the learner no longer has a simulator or “environment oracle” as in most reinforcement learning settings. To solve this problem, we propose a monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space. The method does not rely on the knowledge of the behavior policy, thus can be used to learn from data generated by an unknown policy. Under mild conditions, our algorithm, though surprisingly simple, has a policy improvement bound and outperforms most competing methods empirically. Thorough numerical results are also provided to demonstrate the efficacy of the proposed methodology.

AB - We consider deep policy learning with only batched historical trajectories. The main challenge of this problem is that the learner no longer has a simulator or “environment oracle” as in most reinforcement learning settings. To solve this problem, we propose a monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space. The method does not rely on the knowledge of the behavior policy, thus can be used to learn from data generated by an unknown policy. Under mild conditions, our algorithm, though surprisingly simple, has a policy improvement bound and outperforms most competing methods empirically. Thorough numerical results are also provided to demonstrate the efficacy of the proposed methodology.

UR - http://www.scopus.com/inward/record.url?scp=85064820910&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064820910&partnerID=8YFLogxK

M3 - Conference article

VL - 2018-December

SP - 6288

EP - 6297

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

ER -