Leveraging demonstrations to improve online learning: Quality matters

B Hao, R Jain, T Lattimore… - … on Machine Learning, 2023 - proceedings.mlr.press
We investigate the extent to which offline demonstration data can improve online learning. It
is natural to expect some improvement, but the question is how, and by how much? We …

Conservative exploration in reinforcement learning

E Garcelon, M Ghavamzadeh… - International …, 2020 - proceedings.mlr.press
While learning in an unknown Markov Decision Process (MDP), an agent should trade off
exploration to discover new information about the MDP, and exploitation of the current …

Crush optimism with pessimism: Structured bandits beyond asymptotic optimality

KS Jun, C Zhang - Advances in Neural Information …, 2020 - proceedings.neurips.cc
We study stochastic structured bandits for minimizing regret. The fact that the popular
optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality …

A novel confidence-based algorithm for structured bandits

A Tirinzoni, A Lazaric, M Restelli - … Conference on Artificial …, 2020 - proceedings.mlr.press
We study finite-armed stochastic bandits where the rewards of each arm might be correlated
to those of other arms. We introduce a novel phased algorithm that exploits the given …

Thompson sampling for combinatorial network optimization in unknown environments

A Hüyük, C Tekin - IEEE/ACM Transactions on Networking, 2020 - ieeexplore.ieee.org
Influence maximization, adaptive routing, and dynamic spectrum allocation all require
choosing the right action from a large set of alternatives. Thanks to the advances in …

Adaptive sequential experiments with unknown information arrival processes

Y Gur, A Momeni - Manufacturing & Service Operations …, 2022 - pubsonline.informs.org
Problem definition: Sequential experiments that are deployed in a broad range of practices
are characterized by an exploration-exploitation trade-off that is well understood when in …

Setting reserve prices in second-price auctions with unobserved bids

J Rhuggenaath, A Akcay, Y Zhang… - INFORMS Journal on …, 2022 - pubsonline.informs.org
In this work we consider a seller who sells an item via second-price auctions with a reserve
price. By controlling the reserve price, the seller can influence the revenue from the auction …

Bandit policies for reliable cellular network handovers in extreme mobility

Y Li, E Datta, J Ding, N Shroff, X Liu - arXiv preprint arXiv:2010.15237, 2020 - arxiv.org
The demand for seamless Internet access under extreme user mobility, such as on high-
speed trains and vehicles, has become a norm rather than an exception. However, the …

Taking advantage of common assumptions in policy optimization and reinforcement learning

JW Lavington - 2024 - open.library.ubc.ca
This work considers training conditional probability distributions called policies, using
simulated environments via gradient-based optimization methods. It begins by investigating …

Diversity-Preserving K-Armed Bandits, Revisited

H Hadiji, S Gerchinovitz, JM Loubes… - arXiv preprint arXiv …, 2020 - arxiv.org
We consider the bandit-based framework for diversity-preserving recommendations
introduced by Celis et al.(2019), who approached it mainly by a reduction to the setting of …