Towards continual reinforcement learning: A review and perspectives

K Khetarpal, M Riemer, I Rish, D Precup - Journal of Artificial Intelligence …, 2022 - jair.org
In this article, we aim to provide a literature review of different formulations and approaches
to continual reinforcement learning (RL), also known as lifelong or non-stationary RL. We …

Anytime-valid off-policy inference for contextual bandits

I Waudby-Smith, L Wu, A Ramdas… - ACM/JMS Journal of …, 2024 - dl.acm.org
Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in
healthcare and the tech industry. They involve online learning algorithms that adaptively …

[图书][B] Distributional reinforcement learning

MG Bellemare, W Dabney, M Rowland - 2023 - books.google.com
The first comprehensive guide to distributional reinforcement learning, providing a new
mathematical formalism for thinking about decisions from a probabilistic perspective …

Proximal reinforcement learning: Efficient off-policy evaluation in partially observed markov decision processes

A Bennett, N Kallus - Operations Research, 2024 - pubsonline.informs.org
In applications of offline reinforcement learning to observational data, such as in healthcare
or education, a general concern is that observed actions might be affected by unobserved …

Distributional offline policy evaluation with predictive error guarantees

R Wu, M Uehara, W Sun - International Conference on …, 2023 - proceedings.mlr.press
We study the problem of estimating the distribution of the return of a policy using an offline
dataset that is not generated from the policy, ie, distributional offline policy evaluation (OPE) …

Conformal off-policy prediction in contextual bandits

MF Taufiq, JF Ton, R Cornish… - Advances in Neural …, 2022 - proceedings.neurips.cc
Most off-policy evaluation methods for contextual bandits have focused on the expected
outcome of a policy, which is estimated via methods that at best provide only asymptotic …

Allsim: Simulating and benchmarking resource allocation policies in multi-user systems

J Berrevoets, D Jarrett, A Chan… - Advances in Neural …, 2024 - proceedings.neurips.cc
Numerous real-world systems, ranging from healthcare to energy grids, involve users
competing for finite and potentially scarce resources. Designing policies for resource …

Hope: Human-centric off-policy evaluation for e-learning and healthcare

G Gao, S Ju, MS Ausin, M Chi - arXiv preprint arXiv:2302.09212, 2023 - arxiv.org
Reinforcement learning (RL) has been extensively researched for enhancing human-
environment interactions in various human-centric tasks, including e-learning and …

Opportunities and challenges from using animal videos in reinforcement learning for navigation

V Giammarino, J Queeney, LC Carstensen… - IFAC-PapersOnLine, 2023 - Elsevier
We investigate the use of animal videos (observations) to improve Reinforcement Learning
(RL) efficiency and performance in navigation tasks with sparse rewards. Motivated by …

Off-policy evaluation for action-dependent non-stationary environments

Y Chandak, S Shankar, N Bastian… - Advances in …, 2022 - proceedings.neurips.cc
Methods for sequential decision-making are often built upon a foundational assumption that
the underlying decision process is stationary. This limits the application of such methods …