相关文章- 学术资源搜索

Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

被引用次数：86 相关文章所有 3 个版本

[PDF] aaai.org

Preference ranking optimization for human alignment

F Song, B Yu, M Li, H Yu, F Huang, Y Li… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Large language models (LLMs) often contain misleading content, emphasizing the need to
align them with human values to ensure secure AI systems. Reinforcement learning from …

被引用次数：101 相关文章所有 4 个版本

[PDF] arxiv.org

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

被引用次数：248 相关文章所有 6 个版本

[PDF] arxiv.org

Rrhf: Rank responses to align language models with human feedback without tears

Z Yuan, H Yuan, C Tan, W Wang, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large
language models with human preferences, significantly enhancing the quality of interactions …

被引用次数：176 相关文章所有 2 个版本

[PDF] arxiv.org

Omnisafe: An infrastructure for accelerating safe reinforcement learning research

J Ji, J Zhou, B Zhang, J Dai, X Pan, R Sun… - arXiv preprint arXiv …, 2023 - arxiv.org

AI systems empowered by reinforcement learning (RL) algorithms harbor the immense
potential to catalyze societal advancement, yet their deployment is often impeded by …

被引用次数：22 相关文章所有 4 个版本

[PDF] arxiv.org

Secrets of rlhf in large language models part i: Ppo

R Zheng, S Dou, S Gao, Y Hua, W Shen… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) have formulated a blueprint for the advancement of artificial
general intelligence. Its primary objective is to function as a human-centric (helpful, honest …

被引用次数：53 相关文章所有 4 个版本

[PDF] arxiv.org

Rlaif: Scaling reinforcement learning from human feedback with ai feedback

H Lee, S Phatale, H Mansoor, K Lu, T Mesnard… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large
language models (LLMs) with human preferences. However, gathering high-quality human …

被引用次数：247 相关文章所有 6 个版本

[PDF] arxiv.org

A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

被引用次数：38 相关文章所有 4 个版本

[PDF] arxiv.org

Warm: On the benefits of weight averaged reward models

A Ramé, N Vieillard, L Hussenot, R Dadashi… - arXiv preprint arXiv …, 2024 - arxiv.org

Aligning large language models (LLMs) with human preferences through reinforcement
learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward …

被引用次数：25 相关文章所有 4 个版本

[PDF] arxiv.org

Nash learning from human feedback

R Munos, M Valko, D Calandriello, MG Azar… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm
for aligning large language models (LLMs) with human preferences. Typically, RLHF …

被引用次数：44 相关文章所有 5 个版本

高级搜索

QQ 群

Safe rlhf: Safe reinforcement learning from human feedback

Preference ranking optimization for human alignment

Open problems and fundamental limitations of reinforcement learning from human feedback

Rrhf: Rank responses to align language models with human feedback without tears

Omnisafe: An infrastructure for accelerating safe reinforcement learning research

Secrets of rlhf in large language models part i: Ppo

Rlaif: Scaling reinforcement learning from human feedback with ai feedback

A survey of reinforcement learning from human feedback

Warm: On the benefits of weight averaged reward models

Nash learning from human feedback

相关搜索

引用