related:RH-S-PbNWxwJ:scholar.google.com/

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

被引用次数：248 相关文章所有 6 个版本

[PDF] arxiv.org

A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

被引用次数：38 相关文章所有 4 个版本

[PDF] arxiv.org

Rlaif: Scaling reinforcement learning from human feedback with ai feedback

H Lee, S Phatale, H Mansoor, K Lu, T Mesnard… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large
language models (LLMs) with human preferences. However, gathering high-quality human …

被引用次数：247 相关文章所有 6 个版本

[PDF] arxiv.org

Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

被引用次数：86 相关文章所有 3 个版本

[PDF] arxiv.org

Ultrafeedback: Boosting language models with high-quality feedback

G Cui, L Yuan, N Ding, G Yao, W Zhu, Y Ni… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) has become a pivot technique in
aligning large language models (LLMs) with human preferences. In RLHF practice …

被引用次数：112 相关文章所有 3 个版本

[PDF] arxiv.org

Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms

A Ahmadian, C Cremer, M Gallé, M Fadaee… - arXiv preprint arXiv …, 2024 - arxiv.org

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is
increasingly treated as a crucial ingredient for high performance large language …

被引用次数：19 相关文章所有 2 个版本

The inadequacy of reinforcement learning from human feedback-radicalizing large language models via semantic vulnerabilities

TR McIntosh, T Susnjak, T Liu, P Watters… - … on Cognitive and …, 2024 - ieeexplore.ieee.org

This study is an empirical investigation into the semantic vulnerabilities of four popular pre-
trained commercial Large Language Models (LLMs) to ideological manipulation. Using …

被引用次数：24 相关文章

[PDF] arxiv.org

Rrhf: Rank responses to align language models with human feedback without tears

Z Yuan, H Yuan, C Tan, W Wang, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large
language models with human preferences, significantly enhancing the quality of interactions …

被引用次数：176 相关文章所有 2 个版本

[PDF] arxiv.org

Secrets of rlhf in large language models part i: Ppo

R Zheng, S Dou, S Gao, Y Hua, W Shen… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) have formulated a blueprint for the advancement of artificial
general intelligence. Its primary objective is to function as a human-centric (helpful, honest …

被引用次数：53 相关文章所有 4 个版本

[PDF] arxiv.org

Secrets of rlhf in large language models part ii: Reward modeling

B Wang, R Zheng, L Chen, Y Liu, S Dou… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology
for aligning language models with human values and intentions, enabling models to …

被引用次数：27 相关文章所有 2 个版本

高级搜索

QQ 群

Open problems and fundamental limitations of reinforcement learning from human feedback

A survey of reinforcement learning from human feedback

Rlaif: Scaling reinforcement learning from human feedback with ai feedback

Safe rlhf: Safe reinforcement learning from human feedback

Ultrafeedback: Boosting language models with high-quality feedback

Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms

The inadequacy of reinforcement learning from human feedback-radicalizing large language models via semantic vulnerabilities

Rrhf: Rank responses to align language models with human feedback without tears

Secrets of rlhf in large language models part i: Ppo

Secrets of rlhf in large language models part ii: Reward modeling

相关搜索

引用