A survey on policy search algorithms for learning robot controllers in a handful of trials

K Chatzilygeroudis, V Vassiliades… - IEEE Transactions …, 2019 - ieeexplore.ieee.org
Most policy search (PS) algorithms require thousands of training episodes to find an
effective policy, which is often infeasible with a physical robot. This survey article focuses on …

Maximum a posteriori policy optimisation

A Abdolmaleki, JT Springenberg, Y Tassa… - arXiv preprint arXiv …, 2018 - arxiv.org
We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy
Optimisation (MPO) based on coordinate ascent on a relative entropy objective. We show …

Softmax deep double deterministic policy gradients

L Pan, Q Cai, L Huang - Advances in neural information …, 2020 - proceedings.neurips.cc
A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep
Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can …

Actor-critic policy optimization in partially observable multiagent environments

S Srinivasan, M Lanctot, V Zambaldi… - Advances in neural …, 2018 - proceedings.neurips.cc
Optimization of parameterized policies for reinforcement learning (RL) is an important and
challenging problem in artificial intelligence. Among the most common approaches are …

Deep active inference as variational policy gradients

B Millidge - Journal of Mathematical Psychology, 2020 - Elsevier
Active Inference is a theory arising from theoretical neuroscience which casts action and
planning as Bayesian inference problems to be solved by minimizing a single quantity—the …

Credit assignment for collective multiagent RL with global rewards

DT Nguyen, A Kumar, HC Lau - Advances in neural …, 2018 - proceedings.neurips.cc
Scaling decision theoretic planning to large multiagent systems is challenging due to
uncertainty and partial observability in the environment. We focus on a multiagent planning …

[PDF][PDF] Deep multi-agent reinforcement learning for decentralized continuous cooperative control

CS de Witt, B Peng, PA Kamienny, P Torr… - arXiv preprint arXiv …, 2020 - beipeng.github.io
Deep multi-agent reinforcement learning (MARL) holds the promise of automating many real-
world cooperative robotic manipulation and transportation tasks. Nevertheless …

Importance sampling techniques for policy optimization

AM Metelli, M Papini, N Montali, M Restelli - Journal of Machine Learning …, 2020 - jmlr.org
How can we effectively exploit the collected samples when solving a continuous control task
with Reinforcement Learning? Recent results have empirically demonstrated that multiple …

Proximal policy optimization with policy feedback

Y Gu, Y Cheng, CLP Chen… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Proximal policy optimization (PPO) is a deep reinforcement learning algorithm based on the
actor–critic (AC) architecture. In the classic AC architecture, the Critic (value) network is used …

DAC: The double actor-critic architecture for learning options

S Zhang, S Whiteson - Advances in Neural Information …, 2019 - proceedings.neurips.cc
We reformulate the option framework as two parallel augmented MDPs. Under this novel
formulation, all policy optimization algorithms can be used off the shelf to learn intra-option …