High-confidence off-policy evaluation

P Thomas, G Theocharous… - Proceedings of the AAAI …, 2015 - ojs.aaai.org
Many reinforcement learning algorithms use trajectories collected from the execution of one
or more policies to propose a new policy. Because execution of a bad policy can be costly or …

Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions

O Gottesman, J Futoma, Y Liu… - International …, 2020 - proceedings.mlr.press
Off-policy evaluation in reinforcement learning offers the chance of using observational data
to improve future outcomes in domains such as healthcare and education, but safe …

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Y Zhang, J Liu, C Li, Y Niu, Y Yang, Y Liu… - Proceedings of the …, 2024 - ojs.aaai.org
Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of
offline pretrained policy using only a few online samples. Built on offline RL algorithms, most …

[引用][C] Near optimal provable uniform convergence in off-policy evaluation for reinforcement learning

M Yin, Y Bai, YX Wang - arXiv preprint arXiv:2007.03760, 2020

Towards robust off-policy learning for runtime uncertainty

D Xu, Y Ye, C Ruan, B Yang - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org
Off-policy learning plays a pivotal role in optimizing and evaluating policies prior to the
online deployment. However, during the real-time serving, we observe varieties of …

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

S Liu, S Zhang - Forty-first International Conference on Machine … - openreview.net
Most reinforcement learning practitioners evaluate their policies with online Monte Carlo
estimators for either hyperparameter tuning or testing different algorithmic design choices …

Forward and backward state abstractions for off-policy evaluation

M Hao, P Su, L Hu, Z Szabo, Q Zhao, C Shi - arXiv preprint arXiv …, 2024 - arxiv.org
Off-policy evaluation (OPE) is crucial for evaluating a target policy's impact offline before its
deployment. However, achieving accurate OPE in large state spaces remains challenging …

Probabilistic Offline Policy Ranking with Approximate Bayesian Computation

L Da, P Jenkins, T Schwantes, J Dotson… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In practice, it is essential to compare and rank candidate policies offline before real-world
deployment for safety and reliability. Prior work seeks to solve this offline policy ranking …

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

Y Luo, T Ji, F Sun, J Zhang, H Xu, X Zhan - arXiv preprint arXiv …, 2024 - arxiv.org
Off-policy reinforcement learning (RL) has achieved notable success in tackling many
complex real-world tasks, by leveraging previously collected data for policy learning …

Gradient temporal-difference learning for off-policy evaluation using emphatic weightings

J Cao, Q Liu, F Zhu, Q Fu, S Zhong - Information Sciences, 2021 - Elsevier
The problem of off-policy evaluation (OPE) has long been advocated as one of the foremost
challenges in reinforcement learning. Gradient-based and emphasis-based temporal …