While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current …
KS Jun, C Zhang - Advances in Neural Information …, 2020 - proceedings.neurips.cc
We study stochastic structured bandits for minimizing regret. The fact that the popular optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality …
We study finite-armed stochastic bandits where the rewards of each arm might be correlated to those of other arms. We introduce a novel phased algorithm that exploits the given …
A Hüyük, C Tekin - IEEE/ACM Transactions on Networking, 2020 - ieeexplore.ieee.org
Influence maximization, adaptive routing, and dynamic spectrum allocation all require choosing the right action from a large set of alternatives. Thanks to the advances in …
Y Gur, A Momeni - Manufacturing & Service Operations …, 2022 - pubsonline.informs.org
Problem definition: Sequential experiments that are deployed in a broad range of practices are characterized by an exploration-exploitation trade-off that is well understood when in …
In this work we consider a seller who sells an item via second-price auctions with a reserve price. By controlling the reserve price, the seller can influence the revenue from the auction …
The demand for seamless Internet access under extreme user mobility, such as on high- speed trains and vehicles, has become a norm rather than an exception. However, the …
This work considers training conditional probability distributions called policies, using simulated environments via gradient-based optimization methods. It begins by investigating …
We consider the bandit-based framework for diversity-preserving recommendations introduced by Celis et al.(2019), who approached it mainly by a reduction to the setting of …