Average-reward off-policy policy evaluation with function approximation

E Hassan, MY Shams, NA Hikal, S Elmougy - Multimedia Tools and …, 2023 - Springer

Optimization algorithms are used to improve model accuracy. The optimization process
undergoes multiple cycles until convergence. A variety of optimization strategies have been …

被引用次数：144 相关文章所有 12 个版本

[PDF] neurips.cc

Finite-time analysis of whittle index based Q-learning for restless multi-armed bandits with neural network function approximation

G Xiong, J Li - Advances in Neural Information Processing …, 2023 - proceedings.neurips.cc

Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB)
problem. Although it is provably asymptotically optimal, finding Whittle indices remains …

被引用次数：13 相关文章所有 8 个版本

[PDF] neurips.cc

Markovian interference in experiments

V Farias, A Li, T Peng, A Zheng - Advances in Neural …, 2022 - proceedings.neurips.cc

We consider experiments in dynamical systems where interventions on some experimental
units impact other units through a limiting constraint (such as a limited supply of products) …

被引用次数：40 相关文章所有 8 个版本

[PDF] mlr.press

Breaking the deadly triad with a target network

S Zhang, H Yao, S Whiteson - International Conference on …, 2021 - proceedings.mlr.press

The deadly triad refers to the instability of a reinforcement learning algorithm when it
employs off-policy learning, function approximation, and bootstrapping simultaneously. In …

被引用次数：53 相关文章所有 7 个版本

[PDF] neurips.cc

Finite Sample Analysis of Average-Reward TD Learning and -Learning

S Zhang, Z Zhang, ST Maguluri - Advances in Neural …, 2021 - proceedings.neurips.cc

The focus of this paper is on sample complexity guarantees of average-reward
reinforcement learning algorithms, which are known to be more challenging to study than …

被引用次数：27 相关文章所有 7 个版本

[PDF] mlr.press

Model-free robust average-reward reinforcement learning

Y Wang, A Velasquez, GK Atia… - International …, 2023 - proceedings.mlr.press

Abstract Robust Markov decision processes (MDPs) address the challenge of model
uncertainty by optimizing the worst-case performance over an uncertainty set of MDPs. In …

被引用次数：8 相关文章所有 8 个版本

[PDF] neurips.cc

Optimal uniform OPE and model-based offline reinforcement learning in time-homogeneous, reward-free and task-agnostic settings

M Yin, YX Wang - Advances in neural information …, 2021 - proceedings.neurips.cc

This work studies the statistical limits of uniform convergence for offline policy evaluation
(OPE) problems with model-based methods (for episodic MDP) and provides a unified …

被引用次数：27 相关文章所有 8 个版本

[PDF] arxiv.org

Stochastic first-order methods for average-reward markov decision processes

T Li, F Wu, G Lan - Mathematics of Operations Research, 2024 - pubsonline.informs.org

We study average-reward Markov decision processes (AMDPs) and develop novel first-
order methods with strong theoretical guarantees for both policy optimization and policy …

被引用次数：13 相关文章所有 2 个版本

[PDF] mlr.press

Off-policy average reward actor-critic with deterministic policy search

N Saxena, S Khastagir, S Kolathaya… - International …, 2023 - proceedings.mlr.press

The average reward criterion is relatively less studied as most existing works in the
Reinforcement Learning literature consider the discounted reward criterion. There are few …

被引用次数：5 相关文章所有 12 个版本

[PDF] mlr.press

Modified retrace for off-policy temporal difference learning

X Chen, X Ma, Y Li, G Yang… - Uncertainty in Artificial …, 2023 - proceedings.mlr.press

Off-policy learning is a key to extend reinforcement learning as it allows to learn a target
policy from a different behavior policy that generates the data. However, it is well known as …

被引用次数：3 相关文章所有 5 个版本

高级搜索

QQ 群