Bypassing the simulator: Near-optimal adversarial linear contextual bandits

H Liu, CY Wei, J Zimmert - Advances in Neural Information …, 2024 - proceedings.neurips.cc
We consider the adversarial linear contextual bandit problem, where the loss vectors are
selected fully adversarially and the per-round action set (ie the context) is drawn from a fixed …

Settling the sample complexity of online reinforcement learning

Z Zhang, Y Chen, JD Lee… - The Thirty Seventh Annual …, 2024 - proceedings.mlr.press
A central issue lying at the heart of online reinforcement learning (RL) is data efficiency.
While a number of recent works achieved asymptotically minimal regret in online RL, the …

Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments

R Zhou, Z Zihan, SS Du - International Conference on …, 2023 - proceedings.mlr.press
We study variance-dependent regret bounds for Markov decision processes (MDPs).
Algorithms with variance-dependent regret guarantees can automatically exploit …

Tackling heavy-tailed rewards in reinforcement learning with function approximation: Minimax optimal and instance-dependent regret bounds

J Huang, H Zhong, L Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc
While numerous works have focused on devising efficient algorithms for reinforcement
learning (RL) with uniformly bounded rewards, it remains an open question whether sample …

Noise-adaptive thompson sampling for linear contextual bandits

R Xu, Y Min, T Wang - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Linear contextual bandits represent a fundamental class of models with numerous real-
world applications, and it is critical to develop algorithms that can effectively manage noise …

Is behavior cloning all you need? understanding horizon in imitation learning

DJ Foster, A Block, D Misra - arXiv preprint arXiv:2407.15007, 2024 - arxiv.org
Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision
making task by learning from demonstrations, and has been widely applied to robotics …

Variance-aware regret bounds for stochastic contextual dueling bandits

Q Di, T Jin, Y Wu, H Zhao, F Farnoud, Q Gu - arXiv preprint arXiv …, 2023 - arxiv.org
Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

J He, H Zhong, Z Yang - arXiv preprint arXiv:2404.12648, 2024 - arxiv.org
We study infinite-horizon average-reward Markov decision processes (AMDPs) in the
context of general function approximation. Specifically, we propose a novel algorithmic …

More benefits of being distributional: Second-order bounds for reinforcement learning

K Wang, O Oertell, A Agarwal, N Kallus… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the
return distribution, can obtain second-order bounds in both online and offline RL in general …

Reinforcement learning from human feedback with active queries

K Ji, J He, Q Gu - arXiv preprint arXiv:2402.09401, 2024 - arxiv.org
Aligning large language models (LLM) with human preference plays a key role in building
modern generative models and can be achieved by reinforcement learning from human …