A theoretical analysis of nash learning from human feedback under general kl-regularized preference

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

被引用次数：112 相关文章所有 4 个版本

[PDF] arxiv.org

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W Xiong, X Cheng, L Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

被引用次数：28 相关文章所有 2 个版本

[PDF] arxiv.org

Building math agents with multi-turn iterative preference learning

W Xiong, C Shi, J Shen, A Rosenberg, Z Qin… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent studies have shown that large language models'(LLMs) mathematical problem-
solving capabilities can be enhanced by integrating external tools, such as code …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arXiv preprint arXiv:2405.00675, 2024 - arxiv.org

Traditional reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

被引用次数：64 相关文章所有 2 个版本

[PDF] arxiv.org

Rlhf workflow: From reward modeling to online rlhf

H Dong, W Xiong, B Pang, H Wang, H Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

被引用次数：65 相关文章所有 2 个版本

[PDF] arxiv.org

Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards

H Wang, Y Lin, W Xiong, R Yang, S Diao, S Qiu… - arXiv preprint arXiv …, 2024 - arxiv.org

Fine-grained control over large language models (LLMs) remains a significant challenge,
hindering their adaptability to diverse user needs. While Reinforcement Learning from …

被引用次数：42 相关文章所有 2 个版本

[PDF] arxiv.org

Decoding-time language model alignment with multiple objectives

R Shi, Y Chen, Y Hu, A Liu, H Hajishirzi… - arXiv preprint arXiv …, 2024 - arxiv.org

Aligning language models (LMs) to human preferences has emerged as a critical pursuit,
enabling these models to better serve diverse user needs. Existing methods primarily focus …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

Rebel: Reinforcement learning via regressing relative rewards

Z Gao, JD Chang, W Zhan, O Oertell, G Swamy… - arXiv preprint arXiv …, 2024 - arxiv.org

While originally developed for continuous control problems, Proximal Policy Optimization
(PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

From lists to emojis: How format bias affects model alignment

X Zhang, W Xiong, L Chen, T Zhou, H Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we study format biases in reinforcement learning from human feedback
(RLHF). We observe that many widely-used preference models, including human …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

T Xie, DJ Foster, A Krishnamurthy, C Rosset… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …

被引用次数：18 相关文章所有 2 个版本

高级搜索

QQ 群