W Zhang, J He, Z Fan, Q Gu - International Conference on …, 2023 - proceedings.mlr.press
We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class up to a bounded misspecification …
learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms and was identified in [RS …
W Kim, G Iyengar, A Zeevi - International Conference on …, 2024 - proceedings.mlr.press
We propose a new regret minimization algorithm for episodic sparse linear Markov decision process (SMDP) where the state-transition distribution is a linear function of observed …
Y Chen, J He, Q Gu - International Conference on Machine …, 2022 - proceedings.mlr.press
We study reinforcement learning for infinite-horizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing …
A Ghosh, X Zhou, N Shroff - International Conference on …, 2024 - proceedings.mlr.press
We study the constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value …
Y Wu, J He, Q Gu - Uncertainty in Artificial Intelligence, 2023 - proceedings.mlr.press
Recently, there has been remarkable progress in reinforcement learning (RL) with general function approximation. However, all these works only provide regret or sample complexity …
J Liu, Y Li, L Yang - arXiv preprint arXiv:2402.12711, 2024 - arxiv.org
Existing performance measures for bandit algorithms such as regret, PAC bounds, or uniform-PAC (Dann et al., 2017), typically evaluate the cumulative performance, while …
We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability …
Z Wang, J Xie, Y Chen, J Lui, D Zhou - arXiv preprint arXiv:2403.10732, 2024 - arxiv.org
We investigate the non-stationary stochastic linear bandit problem where the reward distribution evolves each round. Existing algorithms characterize the non-stationarity by the …