We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly …
B Daley, M White, MC Machado - Forty-first International …, 2024 - openreview.net
Multistep returns, such as $ n $-step returns and $\lambda $-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the …
We investigate the use of animal videos (observations) to improve Reinforcement Learning (RL) efficiency and performance in navigation tasks with sparse rewards. Motivated by …
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging …
We study the multi-step off-policy learning approach to distributional RL. Despite the apparent similarity between value-based RL and distributional RL, our study reveals …
B Daley, MC Machado, M White - arXiv preprint arXiv:2406.12284, 2024 - arxiv.org
The recency heuristic in reinforcement learning is the assumption that stimuli that occurred closer in time to an acquired reward should be more heavily reinforced. The recency …
How to make intelligent decisions is a central problem in machine learning and artificial intelligence. Despite recent successes of deep reinforcement learning (RL) in various …
Off-policy Actor-Critic algorithms have demonstrated phenomenal experimental performance but still require better explanations. To this end, we show its policy evaluation error on the …
Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings. However, in the optimal control case, the impact of multi-step …