Policy teaching through reward function learning

Haoqi Zhang*, David C. Parkes, Yiling Chen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

43 Scopus citations

Abstract

Policy teaching considers a Markov Decision Process setting in which an interested party aims to influence an agent's decisions by providing limited incentives. In this paper, we consider the specific objective of inducing a pre-specified desired policy. We examine both the case in which the agent's reward function is known and unknown to the interested party, presenting a linear program for the former case and formulating an active, indirect elicitation method for the latter. We provide conditions for logarithmic convergence, and present a polynomial time algorithm that ensures logarithmic convergence with arbitrarily high probability. We also offer practical elicitation heuristics that can be formulated as linear programs, and demonstrate their effectiveness on a policy teaching problem in a simulated ad-network setting. We extend our methods to handle partial observations and partial target policies, and provide a game-theoretic interpretation of our methods for handling strategic agents.

Original languageEnglish (US)
Title of host publicationEC'09 - Proceedings of the 2009 ACM Conference on Electronic Commerce
Pages295-304
Number of pages10
DOIs
StatePublished - 2009
Event2009 ACM Conference on Electronic Commerce, EC'09 - Stanford, CA, United States
Duration: Jul 6 2009Jul 10 2009

Publication series

NameProceedings of the ACM Conference on Electronic Commerce

Other

Other2009 ACM Conference on Electronic Commerce, EC'09
Country/TerritoryUnited States
CityStanford, CA
Period7/6/097/10/09

Keywords

  • Active indirect elicitation
  • Environment design
  • Policy teaching
  • Preference elicitation
  • Preference learning

ASJC Scopus subject areas

  • Software
  • Computer Science Applications
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Policy teaching through reward function learning'. Together they form a unique fingerprint.

Cite this