Contextual bandits with continuous actions: Smoothing, zooming, and adapting

I Uchendu, T Xiao, Y Lu, B Zhu, M Yan… - International …, 2023 - proceedings.mlr.press

Reinforcement learning (RL) provides a theoretical framework for continuously improving an
agent's behavior via trial and error. However, efficiently learning policies from scratch can be …

被引用次数：124 相关文章所有 10 个版本

[PDF] nowpublishers.com

Introduction to multi-armed bandits

A Slivkins - Foundations and Trends® in Machine Learning, 2019 - nowpublishers.com

Multi-armed bandits a simple but very powerful framework for algorithms that make
decisions over time under uncertainty. An enormous body of work has accumulated over the …

被引用次数：1235 相关文章所有 7 个版本

[PDF] mlr.press

Contextual bandits with large action spaces: Made practical

Y Zhu, DJ Foster, J Langford… - … Conference on Machine …, 2022 - proceedings.mlr.press

A central problem in sequential decision making is to develop algorithms that are practical
and computationally efficient, yet support the use of flexible, general-purpose models …

被引用次数：37 相关文章所有 3 个版本

[PDF] neurips.cc

Model selection for contextual bandits

DJ Foster, A Krishnamurthy… - Advances in Neural …, 2019 - proceedings.neurips.cc

We introduce the problem of model selection for contextual bandits, where a learner must
adapt to the complexity of the optimal policy while balancing exploration and exploitation …

被引用次数：104 相关文章所有 10 个版本

[PDF] neurips.cc

Reliable off-policy learning for dosage combinations

J Schweisthal, D Frauen… - Advances in Neural …, 2024 - proceedings.neurips.cc

Decision-making in personalized medicine such as cancer therapy or critical care must often
make choices for dosage combinations, ie, multiple continuous treatments. Existing work for …

被引用次数：8 相关文章所有 7 个版本

[PDF] mlr.press

Contextual bandits with smooth regret: Efficient learning in continuous action spaces

Y Zhu, P Mineiro - International Conference on Machine …, 2022 - proceedings.mlr.press

Designing efficient general-purpose contextual bandit algorithms that work with large—or
even infinite—action spaces would facilitate application to important scenarios such as …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

Feedback efficient online fine-tuning of diffusion models

M Uehara, Y Zhao, K Black, E Hajiramezanali… - arXiv preprint arXiv …, 2024 - arxiv.org

Diffusion models excel at modeling complex data distributions, including those of images,
proteins, and small molecules. However, in many cases, our goal is to model parts of the …

被引用次数：15 相关文章所有 3 个版本

[PDF] mlr.press

Oracle-efficient pessimism: Offline policy optimization in contextual bandits

L Wang, A Krishnamurthy… - … Conference on Artificial …, 2024 - proceedings.mlr.press

We consider offline policy optimization (OPO) in contextual bandits, where one is given a
fixed dataset of logged interactions. While pessimistic regularizers are typically used to …

被引用次数：8 相关文章所有 3 个版本

[PDF] mlr.press

Adaptive estimator selection for off-policy evaluation

Y Su, P Srinath… - … Conference on Machine …, 2020 - proceedings.mlr.press

We develop a generic data-driven method for estimator selection in off-policy policy
evaluation settings. We establish a strong performance guarantee for the method, showing …

被引用次数：44 相关文章所有 5 个版本

[PDF] arxiv.org

Doubly high-dimensional contextual bandits: An interpretable model for joint assortment-pricing

J Cai, R Chen, MJ Wainwright, L Zhao - arXiv preprint arXiv:2309.08634, 2023 - arxiv.org

Key challenges in running a retail business include how to select products to present to
consumers (the assortment problem), and how to price products (the pricing problem) to …

被引用次数：5 相关文章所有 4 个版本

高级搜索

QQ 群