Provably Efficient Offline Reinforcement Learning for Partially Observable Markov Decision Processes

Hongyi Guo, Qi Cai, Yufeng Zhang, Zhuoran Yang, Zhaoran Wang*

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations


We study offline reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) with possibly infinite state and observation spaces. Under the undercompleteness assumption, the transition kernel can be estimated via solving a series of confounded regression problems. To solve the confounding problem, we select a proper instrumental variable (IV) and solves the IV regression problem to construct confidence regions for the model parameters. We get the final policy via pessimistic planning within the confidence regions. We prove that the proposed algorithm attains an ϵ-optimal policy using an offline dataset containing Oe(1/ϵ2) episodes, provided that the behavior policy has good coverage over the optimal trajectory. To our best knowledge, our algorithm is the first provably sample efficient offline algorithm for POMDPs that is not tabular.

Original languageEnglish (US)
Pages (from-to)8016-8038
Number of pages23
JournalProceedings of Machine Learning Research
StatePublished - 2022
Event39th International Conference on Machine Learning, ICML 2022 - Baltimore, United States
Duration: Jul 17 2022Jul 23 2022

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability


Dive into the research topics of 'Provably Efficient Offline Reinforcement Learning for Partially Observable Markov Decision Processes'. Together they form a unique fingerprint.

Cite this