Data-efficient off-policy policy evaluation for reinforcement learning

P Thomas, E Brunskill - International Conference on …, 2016 - proceedings.mlr.press
In this paper we present a new way of predicting the performance of a reinforcement
learning policy given historical data that may have been generated by a different policy. The …

Meta-gradient reinforcement learning

Z Xu, HP van Hasselt, D Silver - Advances in neural …, 2018 - proceedings.neurips.cc
The goal of reinforcement learning algorithms is to estimate and/or optimise the value
function. However, unlike supervised learning, no teacher or oracle is available to provide …

On monte carlo tree search and reinforcement learning

T Vodopivec, S Samothrakis, B Ster - Journal of Artificial Intelligence …, 2017 - jair.org
Fuelled by successes in Computer Go, Monte Carlo tree search (MCTS) has achieved wide-
spread adoption within the games community. Its links to traditional reinforcement learning …

Fast efficient hyperparameter tuning for policy gradient methods

S Paul, V Kurin, S Whiteson - Advances in Neural …, 2019 - proceedings.neurips.cc
The performance of policy gradient methods is sensitive to hyperparameter settings that
must be tuned for any new application. Widely used grid search methods for tuning …

Automated reinforcement learning (autorl): A survey and open problems

J Parker-Holder, R Rajan, X Song, A Biedenkapp… - Journal of Artificial …, 2022 - jair.org
Abstract The combination of Reinforcement Learning (RL) with deep learning has led to a
series of impressive feats, with many believing (deep) RL provides a path towards generally …

{\epsilon}-bmc: A bayesian ensemble approach to epsilon-greedy exploration in model-free reinforcement learning

M Gimelfarb, S Sanner, CG Lee - arXiv preprint arXiv:2007.00869, 2020 - arxiv.org
Resolving the exploration-exploitation trade-off remains a fundamental problem in the
design and implementation of reinforcement learning (RL) algorithms. In this paper, we …

A greedy approach to adapting the trace parameter for temporal difference learning

M White, A White - arXiv preprint arXiv:1607.00446, 2016 - arxiv.org
One of the main obstacles to broad application of reinforcement learning methods is the
parameter sensitivity of our core learning algorithms. In many large-scale applications …

Fast efficient hyperparameter tuning for policy gradients

S Paul, V Kurin, S Whiteson - arXiv preprint arXiv:1902.06583, 2019 - arxiv.org
The performance of policy gradient methods is sensitive to hyperparameter settings that
must be tuned for any new application. Widely used grid search methods for tuning …

Reinforcement learning with multiple experts: A bayesian model combination approach

M Gimelfarb, S Sanner, CG Lee - Advances in neural …, 2018 - proceedings.neurips.cc
Potential based reward shaping is a powerful technique for accelerating convergence of
reinforcement learning algorithms. Typically, such information includes an estimate of the …

On the Rate of Convergence and Error Bounds for LSTD (λ)

M Tagorti, B Scherrer - International Conference on Machine …, 2015 - proceedings.mlr.press
We consider LSTD (λ), the least-squares temporal-difference algorithm with eligibility traces
algorithm proposed by Boyan (2002). It computes a linear approximation of the value …