Explaining dopamine through prediction errors and beyond

SJ Gershman, JA Assad, SR Datta, SW Linderman… - Nature …, 2024 - nature.com
The most influential account of phasic dopamine holds that it reports reward prediction
errors (RPEs). The RPE-based interpretation of dopamine signaling is, in its original form …

[HTML][HTML] Batch policy learning in average reward markov decision processes

P Liao, Z Qi, R Wan, P Klasnja, SA Murphy - Annals of statistics, 2022 - ncbi.nlm.nih.gov
We consider the batch (off-line) policy learning problem in the infinite horizon Markov
Decision Process. Motivated by mobile health applications, we focus on learning a policy …

Finite-time analysis of whittle index based Q-learning for restless multi-armed bandits with neural network function approximation

G Xiong, J Li - Advances in Neural Information Processing …, 2023 - proceedings.neurips.cc
Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB)
problem. Although it is provably asymptotically optimal, finding Whittle indices remains …

Markovian interference in experiments

V Farias, A Li, T Peng, A Zheng - Advances in Neural …, 2022 - proceedings.neurips.cc
We consider experiments in dynamical systems where interventions on some experimental
units impact other units through a limiting constraint (such as a limited supply of products) …

Breaking the deadly triad with a target network

S Zhang, H Yao, S Whiteson - International Conference on …, 2021 - proceedings.mlr.press
The deadly triad refers to the instability of a reinforcement learning algorithm when it
employs off-policy learning, function approximation, and bootstrapping simultaneously. In …

On-policy deep reinforcement learning for the average-reward criterion

Y Zhang, KW Ross - International Conference on Machine …, 2021 - proceedings.mlr.press
We develop theory and algorithms for average-reward on-policy Reinforcement Learning
(RL). We first consider bounding the difference of the long-term average reward for two …

Average-reward off-policy policy evaluation with function approximation

S Zhang, Y Wan, RS Sutton… - … conference on machine …, 2021 - proceedings.mlr.press
We consider off-policy policy evaluation with function approximation (FA) in average-reward
MDPs, where the goal is to estimate both the reward rate and the differential value function …

Finite Sample Analysis of Average-Reward TD Learning and -Learning

S Zhang, Z Zhang, ST Maguluri - Advances in Neural …, 2021 - proceedings.neurips.cc
The focus of this paper is on sample complexity guarantees of average-reward
reinforcement learning algorithms, which are known to be more challenging to study than …

Influencing long-term behavior in multiagent reinforcement learning

DK Kim, M Riemer, M Liu, J Foerster… - Advances in …, 2022 - proceedings.neurips.cc
The main challenge of multiagent reinforcement learning is the difficulty of learning useful
policies in the presence of other simultaneously learning agents whose changing behaviors …

Single-trajectory distributionally robust reinforcement learning

Z Liang, X Ma, J Blanchet, J Zhang, Z Zhou - arXiv preprint arXiv …, 2023 - arxiv.org
As a framework for sequential decision-making, Reinforcement Learning (RL) has been
regarded as an essential component leading to Artificial General Intelligence (AGI) …