Optidice: Offline policy optimization via stationary distribution correction estimation

J Lee, W Jeon, B Lee, J Pineau… - … Conference on Machine …, 2021 - proceedings.mlr.press
We consider the offline reinforcement learning (RL) setting where the agent aims to optimize
the policy solely from the data without further environment interactions. In offline RL, the
distributional shift becomes the primary source of difficulty, which arises from the deviation of
the target policy being optimized from the behavior policy used for data collection. This
typically causes overestimation of action values, which poses severe problems for model-
free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms …

[PDF][PDF] OPTIDICE: OFFLINE POLICY OPTIMIZATION VIA STATIONARY DISTRIBUTION CORRECTION ESTIMA

J Lee, W Jeon, BJ Lee, J Pineau, KE Kim - ailab.kaist.ac.kr
We consider the offline reinforcement learning (RL) setting where the agent aims to optimize
the policy solely from the data without further environment interactions. In offline RL, the
distributional shift becomes the primary source of difficulty, which arises from the deviation of
the target policy being optimized from the behavior policy used for data collection. This
typically causes overestimation of action values, which poses severe problems for model-
free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms …
以上显示的是最相近的搜索结果。 查看全部搜索结果