Faster non-asymptotic convergence for double q-learning

X Chen, L Zhao - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Actor-critic methods have achieved significant success in many challenging applications.
However, its finite-time convergence is still poorly understood in the most practical single …

被引用次数：8 相关文章所有 8 个版本

[PDF] openreview.net

Reinventing Policy Iteration under Time Inconsistency

NS Lesmana, H Su, CS Pun - Transactions on Machine Learning …, 2022 - openreview.net

Policy iteration (PI) is a fundamental policy search algorithm in standard reinforcement
learning (RL) setting, which can be shown to converge to an optimal policy by policy …

被引用次数：4 相关文章所有 2 个版本

Q-learning with heterogeneous update strategy

T Tan, H Xie, L Feng - Information Sciences, 2024 - Elsevier

A variety of algorithms has been proposed to mitigate the overestimation bias of Q-learning.
These algorithms reduce the estimation of maximum Q-value, ie, homogeneous update. As …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Finite-Time Analysis of Simultaneous Double Q-learning

H Na, D Lee - arXiv preprint arXiv:2406.09946, 2024 - arxiv.org

$ Q $-learning is one of the most fundamental reinforcement learning (RL) algorithms.
Despite its widespread success in various applications, it is prone to overestimation bias in …

[PDF] arxiv.org

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

D Lee, DW Kim - arXiv preprint arXiv:2204.10479, 2022 - arxiv.org

TD-learning is a fundamental algorithm in the field of reinforcement learning (RL), that is
employed to evaluate a given policy by estimating the corresponding value function for a …

Value Bonuses Using Ensemble Errors For Exploration in Reinforcement Learning

A Wahab - 2024 - era.library.ualberta.ca

(RL). The agent acts greedily with respect to an estimate of the value plus what can be seen
as a value bonus. The value bonus can be learned by estimating a value function on reward …

高级搜索

QQ 群