Kto: Model alignment as prospect theoretic optimization

A Albalak, Y Elazar, SM Xie, S Longpre… - arXiv preprint arXiv …, 2024 - arxiv.org

A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

被引用次数：24 相关文章所有 2 个版本

[PDF] openreview.net

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net

This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

被引用次数：21 相关文章所有 3 个版本

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

被引用次数：35 相关文章所有 3 个版本

[PDF] arxiv.org

Direct nash optimization: Teaching language models to self-improve with general preferences

C Rosset, CA Cheng, A Mitra, M Santacroce… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper studies post-training large language models (LLMs) using preference feedback
from a powerful oracle to help a model iteratively improve over itself. The typical approach …

被引用次数：34 相关文章所有 2 个版本

[PDF] arxiv.org

Simpo: Simple preference optimization with a reference-free reward

Y Meng, M Xia, D Chen - arXiv preprint arXiv:2405.14734, 2024 - arxiv.org

Direct Preference Optimization (DPO) is a widely used offline preference optimization
algorithm that reparameterizes reward functions in reinforcement learning from human …

被引用次数：26 相关文章所有 2 个版本

[PDF] arxiv.org

Advancing llm reasoning generalists with preference trees

L Yuan, G Cui, H Wang, N Ding, X Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Negative preference optimization: From catastrophic collapse to effective unlearning

R Zhang, L Lin, Y Bai, S Mei - arXiv preprint arXiv:2404.05868, 2024 - arxiv.org

Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data
during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arXiv preprint arXiv:2405.00675, 2024 - arxiv.org

Traditional reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

被引用次数：19 相关文章所有 2 个版本

[PDF] arxiv.org

Rlhf workflow: From reward modeling to online rlhf

H Dong, W Xiong, B Pang, H Wang, H Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Star-gate: Teaching language models to ask clarifying questions

C Andukuri, JP Fränken, T Gerstenberg… - arXiv preprint arXiv …, 2024 - arxiv.org

When prompting language models to complete a task, users often leave important aspects
unsaid. While asking questions could resolve this ambiguity\citep [GATE;][]{li2023eliciting} …

被引用次数：11 相关文章所有 3 个版本

高级搜索

QQ 群