Regret minimization for reinforcement learning by evaluating the optimal bias function

D Zhou, Q Gu, C Szepesvari - Conference on Learning …, 2021 - proceedings.mlr.press

We study reinforcement learning (RL) with linear function approximation where the
underlying transition probability kernel of the Markov decision process (MDP) is a linear …

被引用次数：244 相关文章所有 7 个版本

[PDF] mlr.press

Nearly minimax optimal reinforcement learning for linear markov decision processes

J He, H Zhao, D Zhou, Q Gu - International Conference on …, 2023 - proceedings.mlr.press

We study reinforcement learning (RL) with linear function approximation. For episodic time-
inhomogeneous linear Markov decision processes (linear MDPs) whose transition …

被引用次数：59 相关文章所有 7 个版本

[PDF] mlr.press

Learning near optimal policies with low inherent bellman error

A Zanette, A Lazaric, M Kochenderfer… - International …, 2020 - proceedings.mlr.press

We study the exploration problem with approximate linear action-value functions in episodic
reinforcement learning under the notion of low inherent Bellman error, a condition normally …

被引用次数：256 相关文章所有 5 个版本

[PDF] neurips.cc

Almost optimal model-free reinforcement learningvia reference-advantage decomposition

Z Zhang, Y Zhou, X Ji - Advances in Neural Information …, 2020 - proceedings.neurips.cc

We study the reinforcement learning problem in the setting of finite-horizon1episodic Markov
Decision Processes (MDPs) with S states, A actions, and episode length H. We propose a …

被引用次数：178 相关文章所有 8 个版本

[PDF] mlr.press

Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon

Z Zhang, X Ji, S Du - Conference on Learning Theory, 2021 - proceedings.mlr.press

Episodic reinforcement learning and contextual bandits are two widely studied sequential
decision-making problems. Episodic reinforcement learning generalizes contextual bandits …

被引用次数：125 相关文章所有 4 个版本

[PDF] neurips.cc

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

S Agrawal, R Jia - Advances in Neural Information …, 2017 - proceedings.neurips.cc

We present an algorithm based on posterior sampling (aka Thompson sampling) that
achieves near-optimal worst-case regret bounds when the underlying Markov Decision …

被引用次数：254 相关文章所有 13 个版本

[PDF] mlr.press

VOL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

A Agarwal, Y Jin, T Zhang - The Thirty Sixth Annual …, 2023 - proceedings.mlr.press

We study time-inhomogeneous episodic reinforcement learning (RL) under general function
approximation and sparse rewards. We design a new algorithm, Variance-weighted …

被引用次数：47 相关文章所有 5 个版本

[PDF] mlr.press

Learning adversarial markov decision processes with bandit feedback and unknown transition

C Jin, T Jin, H Luo, S Sra, T Yu - International Conference on …, 2020 - proceedings.mlr.press

We consider the task of learning in episodic finite-horizon Markov decision processes with
an unknown transition function, bandit feedback, and adversarial losses. We propose an …

被引用次数：110 相关文章所有 8 个版本

[PDF] arxiv.org

Understanding domain randomization for sim-to-real transfer

X Chen, J Hu, C Jin, L Li, L Wang - arXiv preprint arXiv:2110.03239, 2021 - arxiv.org

Reinforcement learning encounters many challenges when applied directly in the real world.
Sim-to-real transfer is widely used to transfer the knowledge learned from simulation to the …

被引用次数：90 相关文章所有 4 个版本

[PDF] mlr.press

Dueling rl: Reinforcement learning with trajectory preferences

A Saha, A Pacchiano, J Lee - International Conference on …, 2023 - proceedings.mlr.press

We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …

被引用次数：34 相关文章

高级搜索

QQ 群