Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity...

H Liu, CY Wei, J Zimmert - Advances in Neural Information …, 2024 - proceedings.neurips.cc

We consider the adversarial linear contextual bandit problem, where the loss vectors are
selected fully adversarially and the per-round action set (ie the context) is drawn from a fixed …

被引用次数：12 相关文章所有 5 个版本

[PDF] mlr.press

Settling the sample complexity of online reinforcement learning

Z Zhang, Y Chen, JD Lee… - The Thirty Seventh Annual …, 2024 - proceedings.mlr.press

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency.
While a number of recent works achieved asymptotically minimal regret in online RL, the …

被引用次数：21 相关文章所有 3 个版本

[PDF] mlr.press

Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments

R Zhou, Z Zihan, SS Du - International Conference on …, 2023 - proceedings.mlr.press

We study variance-dependent regret bounds for Markov decision processes (MDPs).
Algorithms with variance-dependent regret guarantees can automatically exploit …

被引用次数：11 相关文章所有 6 个版本

[PDF] neurips.cc

Tackling heavy-tailed rewards in reinforcement learning with function approximation: Minimax optimal and instance-dependent regret bounds

J Huang, H Zhong, L Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc

While numerous works have focused on devising efficient algorithms for reinforcement
learning (RL) with uniformly bounded rewards, it remains an open question whether sample …

被引用次数：7 相关文章所有 6 个版本

[PDF] neurips.cc

Noise-adaptive thompson sampling for linear contextual bandits

R Xu, Y Min, T Wang - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Linear contextual bandits represent a fundamental class of models with numerous real-
world applications, and it is critical to develop algorithms that can effectively manage noise …

被引用次数：7 相关文章所有 4 个版本

[PDF] arxiv.org

Is behavior cloning all you need? understanding horizon in imitation learning

DJ Foster, A Block, D Misra - arXiv preprint arXiv:2407.15007, 2024 - arxiv.org

Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision
making task by learning from demonstrations, and has been widely applied to robotics …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

Variance-aware regret bounds for stochastic contextual dueling bandits

Q Di, T Jin, Y Wu, H Zhao, F Farnoud, Q Gu - arXiv preprint arXiv …, 2023 - arxiv.org

Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …

被引用次数：10 相关文章所有 4 个版本

[PDF] arxiv.org

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

J He, H Zhong, Z Yang - arXiv preprint arXiv:2404.12648, 2024 - arxiv.org

We study infinite-horizon average-reward Markov decision processes (AMDPs) in the
context of general function approximation. Specifically, we propose a novel algorithmic …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

More benefits of being distributional: Second-order bounds for reinforcement learning

K Wang, O Oertell, A Agarwal, N Kallus… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the
return distribution, can obtain second-order bounds in both online and offline RL in general …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Reinforcement learning from human feedback with active queries

K Ji, J He, Q Gu - arXiv preprint arXiv:2402.09401, 2024 - arxiv.org

Aligning large language models (LLM) with human preference plays a key role in building
modern generative models and can be achieved by reinforcement learning from human …

被引用次数：13 相关文章所有 2 个版本

高级搜索

QQ 群