On the sample complexity of actor-critic method for reinforcement learning with function approximation

H Kumar, A Koppel, A Ribeiro - Machine Learning, 2023 - Springer
Reinforcement learning, mathematically described by Markov Decision Problems, may be
approached either through dynamic programming or policy search. Actor-critic algorithms …

A novel framework for policy mirror descent with general parameterization and linear convergence

C Alfano, R Yuan, P Rebeschini - Advances in Neural …, 2023 - proceedings.neurips.cc
Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe
their success to the use of parameterized policies. However, while theoretical guarantees …

Convergence of actor-critic with multi-layer neural networks

H Tian, A Olshevsky… - Advances in neural …, 2024 - proceedings.neurips.cc
The early theory of actor-critic methods considered convergence using linear function
approximators for the policy and value functions. Recent work has established convergence …

Partially Observed Optimal Stochastic Control: Regularity, Optimality, Approximations, and Learning

AD Kara, S Yuksel - arXiv preprint arXiv:2412.06735, 2024 - arxiv.org
In this review/tutorial article, we present recent progress on optimal control of partially
observed Markov Decision Processes (POMDPs). We first present regularity and continuity …

Closing the gap: Achieving global convergence (last iterate) of actor-critic under markovian sampling with neural network parametrization

M Gaur, AS Bedi, D Wang, V Aggarwal - arXiv preprint arXiv:2405.01843, 2024 - arxiv.org
The current state-of-the-art theoretical analysis of Actor-Critic (AC) algorithms significantly
lags in addressing the practical aspects of AC implementations. This crucial gap needs …

Decision-aware actor-critic with function approximation and theoretical guarantees

S Vaswani, A Kazemi… - Advances in …, 2024 - proceedings.neurips.cc
Actor-critic (AC) methods are widely used in reinforcement learning (RL), and benefit from
the flexibility of using any policy gradient method as the actor and value-based method as …

Sample complexity and overparameterization bounds for temporal-difference learning with neural network approximation

S Cayci, S Satpathi, N He… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In this article, we study the dynamics of temporal-difference (TD) learning with neural
network-based value function approximation over a general state space, namely, neural TD …

Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation

Y Wang, Y Wang, Y Zhou, S Zou - arXiv preprint arXiv:2406.01762, 2024 - arxiv.org
Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement
learning, where the critic uses algorithms, eg, temporal difference (TD) learning with function …

Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds

Z Xu, X Ji, M Chen, M Wang, T Zhao - Journal of Machine Learning …, 2024 - jmlr.org
Policy gradient methods equipped with deep neural networks have achieved great success
in solving high-dimensional reinforcement learning (RL) problems. However, current …

Recurrent Natural Policy Gradient for POMDPs

S Cayci, A Eryilmaz - arXiv preprint arXiv:2405.18221, 2024 - arxiv.org
In this paper, we study a natural policy gradient method based on recurrent neural networks
(RNNs) for partially-observable Markov decision processes, whereby RNNs are used for …