R Jiang, T Zahavy, Z Xu, A White, M Hessel… - arXiv preprint arXiv …, 2021 - arxiv.org
Off-policy learning allows us to learn about possible policies of behavior from experience
generated by a different behavior policy. Temporal difference (TD) learning algorithms can …