Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning

J Hu, S Jiang, SA Harding, H Wu, S Liao - arXiv preprint arXiv:2102.03479, 2021 - arxiv.org
Many complex multi-agent systems such as robot swarms control and autonomous vehicle
coordination can be modeled as Multi-Agent Reinforcement Learning (MARL) tasks. QMIX, a …

The phenomenon of policy churn

T Schaul, A Barreto, J Quan… - Advances in Neural …, 2022 - proceedings.neurips.cc
We identify and study the phenomenon of policy churn, that is, the rapid change of the
greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly …

Averaging -step Returns Reduces Variance in Reinforcement Learning

B Daley, M White, MC Machado - Forty-first International …, 2024 - openreview.net
Multistep returns, such as $ n $-step returns and $\lambda $-returns, are commonly used to
improve the sample efficiency of reinforcement learning (RL) methods. The variance of the …

Opportunities and challenges from using animal videos in reinforcement learning for navigation

V Giammarino, J Queeney, LC Carstensen… - IFAC-PapersOnLine, 2023 - Elsevier
We investigate the use of animal videos (observations) to improve Reinforcement Learning
(RL) efficiency and performance in navigation tasks with sparse rewards. Motivated by …

Trajectory-aware eligibility traces for off-policy reinforcement learning

B Daley, M White, C Amato… - … on Machine Learning, 2023 - proceedings.mlr.press
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement
learning, but counteracting off-policy bias without exacerbating variance is challenging …

The nature of temporal difference errors in multi-step distributional reinforcement learning

Y Tang, R Munos, M Rowland… - Advances in …, 2022 - proceedings.neurips.cc
We study the multi-step off-policy learning approach to distributional RL. Despite the
apparent similarity between value-based RL and distributional RL, our study reveals …

Demystifying the Recency Heuristic in Temporal-Difference Learning

B Daley, MC Machado, M White - arXiv preprint arXiv:2406.12284, 2024 - arxiv.org
The recency heuristic in reinforcement learning is the assumption that stimuli that occurred
closer in time to an acquired reward should be more heavily reinforced. The recency …

Variational oracle guiding for reinforcement learning

D Han, T Kozuno, X Luo, ZY Chen, K Doya… - International …, 2022 - openreview.net
How to make intelligent decisions is a central problem in machine learning and artificial
intelligence. Despite recent successes of deep reinforcement learning (RL) in various …

Explaining off-policy actor-critic from a bias-variance perspective

TH Fan, PJ Ramadge - arXiv preprint arXiv:2110.02421, 2021 - arxiv.org
Off-policy Actor-Critic algorithms have demonstrated phenomenal experimental performance
but still require better explanations. To this end, we show its policy evaluation error on the …

DoMo-AC: doubly multi-step off-policy actor-critic algorithm

Y Tang, T Kozuno, M Rowland… - International …, 2023 - proceedings.mlr.press
Multi-step learning applies lookahead over multiple time steps and has proved valuable in
policy evaluation settings. However, in the optimal control case, the impact of multi-step …