查看文章

mlr.press 中的 [PDF]

Provably efficient exploration in policy optimization

作者

Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang

发表日期

2020/11/21

研讨会论文

International Conference on Machine Learning

页码范围

1283-1294

出版商

PMLR

简介

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an “optimistic version” of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret. Here is the feature dimension, is the episode horizon, and is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

引用总数

被引用次数：294

2020202120222023202431 69 82 70 42

学术搜索中的文章

Provably efficient exploration in policy optimization

Q Cai, Z Yang, C Jin, Z Wang - International Conference on Machine Learning, 2020

被引用次数：294 相关文章所有 9 个版本