the policy solely from the data without further environment interactions. In offline RL, the
distributional shift becomes the primary source of difficulty, which arises from the deviation of
the target policy being optimized from the behavior policy used for data collection. This
typically causes overestimation of action values, which poses severe problems for model-
free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms …
We consider the offline reinforcement learning (RL) setting where the agent aims to optimize
the policy solely from the data without further environment interactions. In offline RL, the
distributional shift becomes the primary source of difficulty, which arises from the deviation of
the target policy being optimized from the behavior policy used for data collection. This
typically causes overestimation of action values, which poses severe problems for model-
free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms …