How Does Variance Shape the Regret in Contextual Bandits?

Z Jia, J Qian, A Rakhlin, CY Wei - arXiv preprint arXiv:2410.12713, 2024 - arxiv.org
We consider realizable contextual bandits with general function approximation, investigating
how small reward variance can lead to better-than-minimax regret bounds. Unlike in …

Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds

Z Wang, D Zhou, J Lui, W Sun - arXiv preprint arXiv:2408.08994, 2024 - arxiv.org
Learning a transition model via Maximum Likelihood Estimation (MLE) followed by planning
inside the learned model is perhaps the most standard and simplest Model-based …

Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

A Cassel, A Rosenberg - arXiv preprint arXiv:2407.03065, 2024 - arxiv.org
Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL)
algorithms in practice. Recently, Sherman et al.[2023a] proposed a PO-based algorithm with …