Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

Preference ranking optimization for human alignment

F Song, B Yu, M Li, H Yu, F Huang, Y Li… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Large language models (LLMs) often contain misleading content, emphasizing the need to
align them with human values to ensure secure AI systems. Reinforcement learning from …

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

Rrhf: Rank responses to align language models with human feedback without tears

Z Yuan, H Yuan, C Tan, W Wang, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large
language models with human preferences, significantly enhancing the quality of interactions …

Omnisafe: An infrastructure for accelerating safe reinforcement learning research

J Ji, J Zhou, B Zhang, J Dai, X Pan, R Sun… - arXiv preprint arXiv …, 2023 - arxiv.org
AI systems empowered by reinforcement learning (RL) algorithms harbor the immense
potential to catalyze societal advancement, yet their deployment is often impeded by …

Secrets of rlhf in large language models part i: Ppo

R Zheng, S Dou, S Gao, Y Hua, W Shen… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have formulated a blueprint for the advancement of artificial
general intelligence. Its primary objective is to function as a human-centric (helpful, honest …

Rlaif: Scaling reinforcement learning from human feedback with ai feedback

H Lee, S Phatale, H Mansoor, K Lu, T Mesnard… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large
language models (LLMs) with human preferences. However, gathering high-quality human …

A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Warm: On the benefits of weight averaged reward models

A Ramé, N Vieillard, L Hussenot, R Dadashi… - arXiv preprint arXiv …, 2024 - arxiv.org
Aligning large language models (LLMs) with human preferences through reinforcement
learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward …

Nash learning from human feedback

R Munos, M Valko, D Calandriello, MG Azar… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm
for aligning large language models (LLMs) with human preferences. Typically, RLHF …