X Ma, X Tang, L Xia, J Yang, Q Zhao - arXiv preprint arXiv:2106.03442, 2021 - arxiv.org
Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the …
Novel advanced policy gradient (APG) algorithms, such as proximal policy optimization (PPO), trust region policy optimization, and their variations, have become the dominant …