Z Zhang, Y Chen, JD Lee… - The Thirty Seventh Annual …, 2024 - proceedings.mlr.press
A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the …
R Zhou, Z Zihan, SS Du - International Conference on …, 2023 - proceedings.mlr.press
We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit …
While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample …
R Xu, Y Min, T Wang - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Linear contextual bandits represent a fundamental class of models with numerous real- world applications, and it is critical to develop algorithms that can effectively manage noise …
Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision making task by learning from demonstrations, and has been widely applied to robotics …
Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such …
J He, H Zhong, Z Yang - arXiv preprint arXiv:2404.12648, 2024 - arxiv.org
We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic …
In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general …
K Ji, J He, Q Gu - arXiv preprint arXiv:2402.09401, 2024 - arxiv.org
Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human …