Bandits with side observations: Bounded vs. logarithmic regret

B Hao, R Jain, T Lattimore… - … on Machine Learning, 2023 - proceedings.mlr.press

We investigate the extent to which offline demonstration data can improve online learning. It
is natural to expect some improvement, but the question is how, and by how much? We …

被引用次数：6 相关文章所有 6 个版本

[PDF] mlr.press

Conservative exploration in reinforcement learning

E Garcelon, M Ghavamzadeh… - International …, 2020 - proceedings.mlr.press

While learning in an unknown Markov Decision Process (MDP), an agent should trade off
exploration to discover new information about the MDP, and exploitation of the current …

被引用次数：31 相关文章所有 11 个版本

[PDF] neurips.cc

Crush optimism with pessimism: Structured bandits beyond asymptotic optimality

KS Jun, C Zhang - Advances in Neural Information …, 2020 - proceedings.neurips.cc

We study stochastic structured bandits for minimizing regret. The fact that the popular
optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality …

被引用次数：18 相关文章所有 6 个版本

[PDF] mlr.press

A novel confidence-based algorithm for structured bandits

A Tirinzoni, A Lazaric, M Restelli - … Conference on Artificial …, 2020 - proceedings.mlr.press

We study finite-armed stochastic bandits where the rewards of each arm might be correlated
to those of other arms. We introduce a novel phased algorithm that exploits the given …

被引用次数：16 相关文章所有 6 个版本

[PDF] arxiv.org

Thompson sampling for combinatorial network optimization in unknown environments

A Hüyük, C Tekin - IEEE/ACM Transactions on Networking, 2020 - ieeexplore.ieee.org

Influence maximization, adaptive routing, and dynamic spectrum allocation all require
choosing the right action from a large set of alternatives. Thanks to the advances in …

被引用次数：16 相关文章所有 10 个版本

[PDF] arxiv.org

Adaptive sequential experiments with unknown information arrival processes

Y Gur, A Momeni - Manufacturing & Service Operations …, 2022 - pubsonline.informs.org

Problem definition: Sequential experiments that are deployed in a broad range of practices
are characterized by an exploration-exploitation trade-off that is well understood when in …

被引用次数：12 相关文章所有 10 个版本

[PDF] tue.nl

Setting reserve prices in second-price auctions with unobserved bids

J Rhuggenaath, A Akcay, Y Zhang… - INFORMS Journal on …, 2022 - pubsonline.informs.org

In this work we consider a seller who sells an item via second-price auctions with a reserve
price. By controlling the reserve price, the seller can influence the revenue from the auction …

被引用次数：2 相关文章所有 7 个版本

[PDF] arxiv.org

Bandit policies for reliable cellular network handovers in extreme mobility

Y Li, E Datta, J Ding, N Shroff, X Liu - arXiv preprint arXiv:2010.15237, 2020 - arxiv.org

The demand for seamless Internet access under extreme user mobility, such as on high-
speed trains and vehicles, has become a norm rather than an exception. However, the …

被引用次数：3 相关文章所有 3 个版本

[PDF] ubc.ca

Taking advantage of common assumptions in policy optimization and reinforcement learning

JW Lavington - 2024 - open.library.ubc.ca

This work considers training conditional probability distributions called policies, using
simulated environments via gradient-based optimization methods. It begins by investigating …

Diversity-Preserving K-Armed Bandits, Revisited

H Hadiji, S Gerchinovitz, JM Loubes… - arXiv preprint arXiv …, 2020 - arxiv.org

We consider the bandit-based framework for diversity-preserving recommendations
introduced by Celis et al.(2019), who approached it mainly by a reduction to the setting of …

被引用次数：1 相关文章所有 44 个版本

高级搜索

QQ 群