Robust preference optimization through reward model distillation

B Adler, N Agarwal, A Aithal, DH Anh… - arXiv preprint arXiv …, 2024 - arxiv.org

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base,
Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

J Liu, Z Zhou, J Liu, X Bu, C Yang, HS Zhong… - arXiv preprint arXiv …, 2024 - arxiv.org

Direct Preference Optimization (DPO), a standard method for aligning language models with
human preferences, is traditionally applied to offline preferences. Recent studies show that …

被引用次数：2 相关文章

[PDF] arxiv.org

Understanding Preference Fine-Tuning Through the Lens of Coverage

Y Song, G Swamy, A Singh, JA Bagnell… - arXiv preprint arXiv …, 2024 - arxiv.org

Learning from human preference data has emerged as the dominant paradigm for fine-
tuning large language models (LLMs). The two most common families of techniques--online …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization

A Huang, W Zhan, T Xie, JD Lee, W Sun… - arXiv preprint arXiv …, 2024 - arxiv.org

Language model alignment methods, such as reinforcement learning from human feedback
(RLHF), have led to impressive advances in language model capabilities, but existing …

Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization

J Wang, Y Zhou, X Zhang, M Bao, P Yan - arXiv preprint arXiv:2409.11212, 2024 - arxiv.org

Iterative preference optimization has recently become one of the de-facto training paradigms
for large language models (LLMs), but the performance is still underwhelming due to too …

高级搜索

QQ 群