Jump-start reinforcement learning

I Uchendu, T Xiao, Y Lu, B Zhu, M Yan… - International …, 2023 - proceedings.mlr.press
Reinforcement learning (RL) provides a theoretical framework for continuously improving an
agent's behavior via trial and error. However, efficiently learning policies from scratch can be …

Introduction to multi-armed bandits

A Slivkins - Foundations and Trends® in Machine Learning, 2019 - nowpublishers.com
Multi-armed bandits a simple but very powerful framework for algorithms that make
decisions over time under uncertainty. An enormous body of work has accumulated over the …

Contextual bandits with large action spaces: Made practical

Y Zhu, DJ Foster, J Langford… - … Conference on Machine …, 2022 - proceedings.mlr.press
A central problem in sequential decision making is to develop algorithms that are practical
and computationally efficient, yet support the use of flexible, general-purpose models …

Model selection for contextual bandits

DJ Foster, A Krishnamurthy… - Advances in Neural …, 2019 - proceedings.neurips.cc
We introduce the problem of model selection for contextual bandits, where a learner must
adapt to the complexity of the optimal policy while balancing exploration and exploitation …

Reliable off-policy learning for dosage combinations

J Schweisthal, D Frauen… - Advances in Neural …, 2024 - proceedings.neurips.cc
Decision-making in personalized medicine such as cancer therapy or critical care must often
make choices for dosage combinations, ie, multiple continuous treatments. Existing work for …

Contextual bandits with smooth regret: Efficient learning in continuous action spaces

Y Zhu, P Mineiro - International Conference on Machine …, 2022 - proceedings.mlr.press
Designing efficient general-purpose contextual bandit algorithms that work with large—or
even infinite—action spaces would facilitate application to important scenarios such as …

Feedback efficient online fine-tuning of diffusion models

M Uehara, Y Zhao, K Black, E Hajiramezanali… - arXiv preprint arXiv …, 2024 - arxiv.org
Diffusion models excel at modeling complex data distributions, including those of images,
proteins, and small molecules. However, in many cases, our goal is to model parts of the …

Oracle-efficient pessimism: Offline policy optimization in contextual bandits

L Wang, A Krishnamurthy… - … Conference on Artificial …, 2024 - proceedings.mlr.press
We consider offline policy optimization (OPO) in contextual bandits, where one is given a
fixed dataset of logged interactions. While pessimistic regularizers are typically used to …

Adaptive estimator selection for off-policy evaluation

Y Su, P Srinath… - … Conference on Machine …, 2020 - proceedings.mlr.press
We develop a generic data-driven method for estimator selection in off-policy policy
evaluation settings. We establish a strong performance guarantee for the method, showing …

Doubly high-dimensional contextual bandits: An interpretable model for joint assortment-pricing

J Cai, R Chen, MJ Wainwright, L Zhao - arXiv preprint arXiv:2309.08634, 2023 - arxiv.org
Key challenges in running a retail business include how to select products to present to
consumers (the assortment problem), and how to price products (the pricing problem) to …