相关文章- 学术资源搜索

Deeply-debiased off-policy interval estimation

C Shi, R Wan, V Chernozhukov… - … conference on machine …, 2021 - proceedings.mlr.press

Off-policy evaluation learns a target policy's value with a historical dataset generated by a
different behavior policy. In addition to a point estimate, many applications would benefit …

被引用次数：38 相关文章所有 5 个版本

[PDF] neurips.cc

Coindice: Off-policy confidence interval estimation

B Dai, O Nachum, Y Chow, L Li… - Advances in neural …, 2020 - proceedings.neurips.cc

We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning,
where the goal is to estimate a confidence interval on a target policy's value, given only …

被引用次数：84 相关文章所有 13 个版本

[PDF] neurips.cc

Variance-aware off-policy evaluation with linear function approximation

Y Min, T Wang, D Zhou, Q Gu - Advances in neural …, 2021 - proceedings.neurips.cc

We study the off-policy evaluation (OPE) problem in reinforcement learning with linear
function approximation, which aims to estimate the value function of a target policy based on …

被引用次数：35 相关文章所有 11 个版本

[PDF] arxiv.org

Multi-step off-policy learning without importance sampling ratios

AR Mahmood, H Yu, RS Sutton - arXiv preprint arXiv:1702.03006, 2017 - arxiv.org

To estimate the value functions of policies from exploratory data, most model-free off-policy
algorithms rely on importance sampling, where the use of importance sampling ratios often …

被引用次数：53 相关文章所有 2 个版本

[PDF] mlr.press

Accountable off-policy evaluation with kernel bellman statistics

Y Feng, T Ren, Z Tang, Q Liu - International Conference on …, 2020 - proceedings.mlr.press

We consider off-policy evaluation (OPE), which evaluates the performance of a new policy
from observed data collected from previous experiments, without requiring the execution of …

被引用次数：42 相关文章所有 5 个版本

[PDF] mlr.press

Adaptive estimator selection for off-policy evaluation

Y Su, P Srinath… - … Conference on Machine …, 2020 - proceedings.mlr.press

We develop a generic data-driven method for estimator selection in off-policy policy
evaluation settings. We establish a strong performance guarantee for the method, showing …

被引用次数：39 相关文章所有 5 个版本

[PDF] mlr.press

An instrumental variable approach to confounded off-policy evaluation

Y Xu, J Zhu, C Shi, S Luo… - … Conference on Machine …, 2023 - proceedings.mlr.press

Off-policy evaluation (OPE) aims to estimate the return of a target policy using some pre-
collected observational data generated by a potentially different behavior policy. In many …

被引用次数：11 相关文章所有 7 个版本

[PDF] tandfonline.com

Off-policy confidence interval estimation with confounded markov decision process

C Shi, J Zhu, Y Shen, S Luo, H Zhu… - Journal of the American …, 2024 - Taylor & Francis

This article is concerned with constructing a confidence interval for a target policy's value
offline based on a pre-collected observational data in infinite horizon settings. Most of the …

被引用次数：33 相关文章所有 11 个版本

[PDF] mlr.press

Understanding the curse of horizon in off-policy evaluation via conditional importance sampling

Y Liu, PL Bacon, E Brunskill - International Conference on …, 2020 - proceedings.mlr.press

Off-policy policy estimators that use importance sampling (IS) can suffer from high variance
in long-horizon domains, and there has been particular excitement over new IS methods that …

被引用次数：41 相关文章所有 7 个版本

[PDF] arxiv.org

Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning

N Kallus, M Uehara - arXiv preprint arXiv:1909.05850, 2019 - arxiv.org

Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long-and
infinite-horizon settings due to diminishing overlap between behavior and target policies. In …

被引用次数：22 相关文章所有 2 个版本

高级搜索

QQ 群

Deeply-debiased off-policy interval estimation

Coindice: Off-policy confidence interval estimation

Variance-aware off-policy evaluation with linear function approximation

Multi-step off-policy learning without importance sampling ratios

Accountable off-policy evaluation with kernel bellman statistics

Adaptive estimator selection for off-policy evaluation

An instrumental variable approach to confounded off-policy evaluation

Off-policy confidence interval estimation with confounded markov decision process

Understanding the curse of horizon in off-policy evaluation via conditional importance sampling

Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning

相关搜索

引用