Token-level Direct Preference Optimization

文章

学术资源搜索

获得 4 条结果（用时0.02秒）

我的图书馆

Token-level Direct Preference Optimization

在引用文章中搜索

[PDF] arxiv.org

Robust preference optimization through reward model distillation

A Fisch, J Eisenstein, V Zayats, A Agarwal… - arXiv preprint arXiv …, 2024 - arxiv.org

Language model (LM) post-training (or alignment) involves maximizing a reward function
that is derived from preference annotations. Direct Preference Optimization (DPO) is a …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

X Zhang, C Du, T Pang, Q Liu, W Gao, M Lin - arXiv preprint arXiv …, 2024 - arxiv.org

The recent development of chain-of-thought (CoT) decoding has enabled large language
models (LLMs) to generate explicit logical reasoning paths for complex problem-solving …

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

C Chen, Y Hu, W Wu, H Wang, ES Chng… - arXiv preprint arXiv …, 2024 - arxiv.org

In recent years, text-to-speech (TTS) technology has witnessed impressive advancements,
particularly with large-scale training datasets, showcasing human-level speech quality and …

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

J Lu, J Li, S An, M Zhao, Y He, D Yin, X Sun - arXiv preprint arXiv …, 2024 - arxiv.org

Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct
and robust alignment of Large Language Models (LLMs) with human preferences, offering a …

高级搜索

QQ 群

Token-level Direct Preference Optimization

Robust preference optimization through reward model distillation

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

引用