Is pessimism provably efficient for offline rl?

Y Jin, Z Yang, Z Wang - International Conference on …, 2021 - proceedings.mlr.press
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on
a dataset collected a priori. Due to the lack of further interactions with the environment …

Provable benefits of actor-critic methods for offline reinforcement learning

A Zanette, MJ Wainwright… - Advances in neural …, 2021 - proceedings.neurips.cc
Actor-critic methods are widely used in offline reinforcement learningpractice, but are not so
well-understood theoretically. We propose a newoffline actor-critic algorithm that naturally …

Statistical inference of the value function for reinforcement learning in infinite-horizon settings

C Shi, S Zhang, W Lu, R Song - Journal of the Royal Statistical …, 2022 - academic.oup.com
Reinforcement learning is a general technique that allows an agent to learn an optimal
policy and interact with an environment in sequential decision-making problems. The …

When is realizability sufficient for off-policy reinforcement learning?

A Zanette - International Conference on Machine Learning, 2023 - proceedings.mlr.press
Understanding when reinforcement learning algorithms can make successful off-policy
predictions—and when the may fail to do so–remains an open problem. Typically, model …

Off-policy confidence interval estimation with confounded markov decision process

C Shi, J Zhu, Y Shen, S Luo, H Zhu… - Journal of the American …, 2024 - Taylor & Francis
This article is concerned with constructing a confidence interval for a target policy's value
offline based on a pre-collected observational data in infinite horizon settings. Most of the …

Instabilities of offline rl with pre-trained neural representation

R Wang, Y Wu, R Salakhutdinov… - … on Machine Learning, 2021 - proceedings.mlr.press
In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn)
policies in scenarios where the data are collected from a distribution that substantially differs …

Nearly horizon-free offline reinforcement learning

T Ren, J Li, B Dai, SS Du… - Advances in neural …, 2021 - proceedings.neurips.cc
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision
Processes (MDP). For tabular MDP with $ S $ states and $ A $ actions, or linear MDP with …

Post-contextual-bandit inference

A Bibaut, M Dimakopoulou, N Kallus… - Advances in neural …, 2021 - proceedings.neurips.cc
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-
commerce, healthcare, and policymaking because they can both improve outcomes for …

Deeply-debiased off-policy interval estimation

C Shi, R Wan, V Chernozhukov… - … conference on machine …, 2021 - proceedings.mlr.press
Off-policy evaluation learns a target policy's value with a historical dataset generated by a
different behavior policy. In addition to a point estimate, many applications would benefit …

Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders

A Bennett, N Kallus, L Li… - … Conference on Artificial …, 2021 - proceedings.mlr.press
Off-policy evaluation (OPE) in reinforcement learning is an important problem in settings
where experimentation is limited, such as healthcare. But, in these very same settings …