Learning and planning in average-reward markov decision processes

SJ Gershman, JA Assad, SR Datta, SW Linderman… - Nature …, 2024 - nature.com

The most influential account of phasic dopamine holds that it reports reward prediction
errors (RPEs). The RPE-based interpretation of dopamine signaling is, in its original form …

被引用次数：15 相关文章所有 4 个版本

[HTML] nih.gov

[HTML][HTML] Batch policy learning in average reward markov decision processes

P Liao, Z Qi, R Wan, P Klasnja, SA Murphy - Annals of statistics, 2022 - ncbi.nlm.nih.gov

We consider the batch (off-line) policy learning problem in the infinite horizon Markov
Decision Process. Motivated by mobile health applications, we focus on learning a policy …

被引用次数：92 相关文章所有 9 个版本

[PDF] neurips.cc

Finite-time analysis of whittle index based Q-learning for restless multi-armed bandits with neural network function approximation

G Xiong, J Li - Advances in Neural Information Processing …, 2023 - proceedings.neurips.cc

Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB)
problem. Although it is provably asymptotically optimal, finding Whittle indices remains …

被引用次数：13 相关文章所有 8 个版本

[PDF] neurips.cc

Markovian interference in experiments

V Farias, A Li, T Peng, A Zheng - Advances in Neural …, 2022 - proceedings.neurips.cc

We consider experiments in dynamical systems where interventions on some experimental
units impact other units through a limiting constraint (such as a limited supply of products) …

被引用次数：40 相关文章所有 8 个版本

[PDF] mlr.press

Breaking the deadly triad with a target network

S Zhang, H Yao, S Whiteson - International Conference on …, 2021 - proceedings.mlr.press

The deadly triad refers to the instability of a reinforcement learning algorithm when it
employs off-policy learning, function approximation, and bootstrapping simultaneously. In …

被引用次数：53 相关文章所有 7 个版本

[PDF] mlr.press

On-policy deep reinforcement learning for the average-reward criterion

Y Zhang, KW Ross - International Conference on Machine …, 2021 - proceedings.mlr.press

We develop theory and algorithms for average-reward on-policy Reinforcement Learning
(RL). We first consider bounding the difference of the long-term average reward for two …

被引用次数：49 相关文章所有 4 个版本

[PDF] mlr.press

Average-reward off-policy policy evaluation with function approximation

S Zhang, Y Wan, RS Sutton… - … conference on machine …, 2021 - proceedings.mlr.press

We consider off-policy policy evaluation with function approximation (FA) in average-reward
MDPs, where the goal is to estimate both the reward rate and the differential value function …

被引用次数：39 相关文章所有 8 个版本

[PDF] neurips.cc

Finite Sample Analysis of Average-Reward TD Learning and -Learning

S Zhang, Z Zhang, ST Maguluri - Advances in Neural …, 2021 - proceedings.neurips.cc

The focus of this paper is on sample complexity guarantees of average-reward
reinforcement learning algorithms, which are known to be more challenging to study than …

被引用次数：27 相关文章所有 7 个版本

[PDF] neurips.cc

Influencing long-term behavior in multiagent reinforcement learning

DK Kim, M Riemer, M Liu, J Foerster… - Advances in …, 2022 - proceedings.neurips.cc

The main challenge of multiagent reinforcement learning is the difficulty of learning useful
policies in the presence of other simultaneously learning agents whose changing behaviors …

被引用次数：22 相关文章所有 8 个版本

[PDF] arxiv.org

Single-trajectory distributionally robust reinforcement learning

Z Liang, X Ma, J Blanchet, J Zhang, Z Zhou - arXiv preprint arXiv …, 2023 - arxiv.org

As a framework for sequential decision-making, Reinforcement Learning (RL) has been
regarded as an essential component leading to Artificial General Intelligence (AGI) …

被引用次数：13 相关文章所有 5 个版本

高级搜索

QQ 群