作者
Qingyuan Wu, Yuhui Wang
发表日期
2022
简介
Ensemble Reinforcement Learning (RL) method is prevalent to improve learning efficiency via multiple models. Except for the improved model accuracy inheriting from supervised learning, the exploration diversity can also be enhanced linearly with the multiple policies in the context of RL. However, the linearly-enhanced exploration diversity is still far from meeting the challenge of the curse of dimensionality, the fact that the number of states grows exponentially with the dimension. Moreover, as the high rewards are often captured by part of the models while ignored by other models, the interagted ensemble model could suffer from a risk of underestimation. In this work, we propose an alternative ensemble RL method to mitigate the above issues. The new method, named TEAM Q-learning, integrates two key ingredients:(1) a new temporally-varying strategy, which enhances the exploration diversity by re-selecting a candidate policy per time step, and (2) a new expected-max Q-ensemble operator, which mitigates the risk of underestimation by integrating the optimal estimates as learning target. Besides, we theoretically show that the exploration diversity of the new strategy increases exponentially with the ensemble size, and our operator generally converges faster than the classical ensemble operator. Moreover, we propose two variants, named TEAM DQN and DP-TEAM DQN, for deep and distributed RL, respectively. We empirically show that our methods are superior to other popular ensemble RL methods in tabular and Atari benchmark environments.