A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W Xiong, X Cheng, L Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Building math agents with multi-turn iterative preference learning

W Xiong, C Shi, J Shen, A Rosenberg, Z Qin… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent studies have shown that large language models'(LLMs) mathematical problem-
solving capabilities can be enhanced by integrating external tools, such as code …

Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arXiv preprint arXiv:2405.00675, 2024 - arxiv.org
Traditional reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

Rlhf workflow: From reward modeling to online rlhf

H Dong, W Xiong, B Pang, H Wang, H Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards

H Wang, Y Lin, W Xiong, R Yang, S Diao, S Qiu… - arXiv preprint arXiv …, 2024 - arxiv.org
Fine-grained control over large language models (LLMs) remains a significant challenge,
hindering their adaptability to diverse user needs. While Reinforcement Learning from …

Decoding-time language model alignment with multiple objectives

R Shi, Y Chen, Y Hu, A Liu, H Hajishirzi… - arXiv preprint arXiv …, 2024 - arxiv.org
Aligning language models (LMs) to human preferences has emerged as a critical pursuit,
enabling these models to better serve diverse user needs. Existing methods primarily focus …

Rebel: Reinforcement learning via regressing relative rewards

Z Gao, JD Chang, W Zhan, O Oertell, G Swamy… - arXiv preprint arXiv …, 2024 - arxiv.org
While originally developed for continuous control problems, Proximal Policy Optimization
(PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) …

From lists to emojis: How format bias affects model alignment

X Zhang, W Xiong, L Chen, T Zhou, H Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we study format biases in reinforcement learning from human feedback
(RLHF). We observe that many widely-used preference models, including human …

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

T Xie, DJ Foster, A Krishnamurthy, C Rosset… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …