Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits...

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net

This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

被引用次数：68 相关文章所有 3 个版本

[PDF] neurips.cc

Corruption-robust offline reinforcement learning with general function approximation

C Ye, R Yang, Q Gu, T Zhang - Advances in Neural …, 2024 - proceedings.neurips.cc

We investigate the problem of corruption robustness in offline reinforcement learning (RL)
with general function approximation, where an adversary can corrupt each sample in the …

被引用次数：18 相关文章所有 7 个版本

[PDF] arxiv.org

Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf

W Xiong, H Dong, C Ye, H Zhong, N Jiang… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

被引用次数：46 相关文章所有 2 个版本

[PDF] mlr.press

When is realizability sufficient for off-policy reinforcement learning?

A Zanette - International Conference on Machine Learning, 2023 - proceedings.mlr.press

Understanding when reinforcement learning algorithms can make successful off-policy
predictions—and when the may fail to do so–remains an open problem. Typically, model …

被引用次数：19 相关文章所有 7 个版本

[PDF] arxiv.org

A theoretical analysis of nash learning from human feedback under general kl-regularized preference

C Ye, W Xiong, Y Zhang, N Jiang, T Zhang - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) learns from the preference signal
provided by a probabilistic preference model, which takes a prompt and two responses as …

被引用次数：27 相关文章所有 2 个版本

[PDF] arxiv.org

Towards robust offline reinforcement learning under diverse data corruption

R Yang, H Zhong, J Xu, A Zhang, C Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Offline reinforcement learning (RL) presents a promising approach for learning reinforced
policies from offline datasets without the need for costly or unsafe interactions with the …

被引用次数：13 相关文章所有 3 个版本

[PDF] neurips.cc

Robust lipschitz bandits to adversarial corruptions

Y Kang, CJ Hsieh, TCM Lee - Advances in Neural …, 2023 - proceedings.neurips.cc

Lipschitz bandit is a variant of stochastic bandits that deals with a continuous arm set
defined on a metric space, where the reward function is subject to a Lipschitz constraint. In …

被引用次数：11 相关文章所有 6 个版本

[PDF] neurips.cc

Noise-adaptive thompson sampling for linear contextual bandits

R Xu, Y Min, T Wang - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Linear contextual bandits represent a fundamental class of models with numerous real-
world applications, and it is critical to develop algorithms that can effectively manage noise …

被引用次数：7 相关文章所有 4 个版本

[PDF] arxiv.org

Pessimistic nonlinear least-squares value iteration for offline reinforcement learning

Q Di, H Zhao, J He, Q Gu - arXiv preprint arXiv:2310.01380, 2023 - arxiv.org

Offline reinforcement learning (RL), where the agent aims to learn the optimal policy based
on the data collected by a behavior policy, has attracted increasing attention in recent years …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

A nearly optimal and low-switching algorithm for reinforcement learning with general function approximation

H Zhao, J He, Q Gu - arXiv preprint arXiv:2311.15238, 2023 - arxiv.org

The exploration-exploitation dilemma has been a central challenge in reinforcement
learning (RL) with complex model classes. In this paper, we propose a new algorithm …

被引用次数：6 相关文章所有 3 个版本

高级搜索

QQ 群