Cross-Validated Off-Policy Evaluation

M Cief, M Kompan, B Kveton - arXiv preprint arXiv:2405.15332, 2024 - arxiv.org
In this paper, we study the problem of estimator selection and hyper-parameter tuning in off-
policy evaluation. Although cross-validation is the most popular method for model selection …

Pessimistic model selection for offline deep reinforcement learning

CHH Yang, Z Qi, Y Cui… - Uncertainty in Artificial …, 2023 - proceedings.mlr.press
Abstract Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving
sequential decision making problems in many applications. Despite its promising …

[PDF][PDF] A Rigorous Risk-aware Linear Approach to Extended Markov Ratio Decision Processes with Embedded Learning.

A Zadorojniy, T Osogami, O Davidovich - IJCAI, 2023 - ijcai.org
We consider the problem of risk-aware Markov Decision Processes (MDPs) for Safe AI. We
introduce a theoretical framework, Extended Markov Ratio Decision Processes (EMRDP) …

AutoOPE: Automated Off-Policy Estimator Selection

N Felicioni, M Benigni, MF Dacrema - arXiv preprint arXiv:2406.18022, 2024 - arxiv.org
The Off-Policy Evaluation (OPE) problem consists of evaluating the performance of
counterfactual policies with data collected by another one. This problem is of utmost …

[PDF][PDF] Large-scale open dataset, pipeline, and benchmark for bandit algorithms

Y Saito, S Aihara, M Matsutani… - arXiv preprint arXiv …, 2020 - dynamicdecisions.github.io
We build and publicize the Open Bandit Dataset to facilitate scalable and reproducible
research on bandit algorithms. It is especially suitable for off-policy evaluation (OPE), which …

A Fine-grained Analysis of Fitted Q-evaluation: Beyond Parametric Models

J Wang, Z Qi, RKW Wong - arXiv preprint arXiv:2406.10438, 2024 - arxiv.org
In this paper, we delve into the statistical analysis of the fitted Q-evaluation (FQE) method,
which focuses on estimating the value of a target policy using offline data generated by …

Is Separately Modeling Subpopulations Beneficial for Sequential Decision-Making?

I Lee - Operations Research, 2023 - pubsonline.informs.org
In recent applications of Markov decision processes (MDPs), it is common to estimate
transition probabilities and rewards from transition data. In healthcare and some other …

Optimizing Warfarin Dosing Using Contextual Bandit: An Offline Policy Learning and Evaluation Method

Y Huang, CA Downs, AM Rahmani - arXiv preprint arXiv:2402.11123, 2024 - arxiv.org
Warfarin, an anticoagulant medication, is formulated to prevent and address conditions
associated with abnormal blood clotting, making it one of the most prescribed drugs globally …

Leveraging Factored Action Spaces for Off-Policy Evaluation

A Rebello, S Tang, J Wiens, S Parbhoo - arXiv preprint arXiv:2307.07014, 2023 - arxiv.org
Off-policy evaluation (OPE) aims to estimate the benefit of following a counterfactual
sequence of actions, given data collected from executed sequences. However, existing OPE …

Off environment evaluation using convex risk minimization

P Katdare, S Liu, KD Campbell - … International Conference on …, 2022 - ieeexplore.ieee.org
Applying reinforcement learning (RL) methods on robots typically involves training a policy
in simulation and deploying it on a robot in the real world. Because of the model mismatch …