Robust preference optimization through reward model distillation

A Fisch, J Eisenstein, V Zayats, A Agarwal… - arXiv preprint arXiv …, 2024 - arxiv.org
Language model (LM) post-training (or alignment) involves maximizing a reward function
that is derived from preference annotations. Direct Preference Optimization (DPO) is a …

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

X Zhang, C Du, T Pang, Q Liu, W Gao, M Lin - arXiv preprint arXiv …, 2024 - arxiv.org
The recent development of chain-of-thought (CoT) decoding has enabled large language
models (LLMs) to generate explicit logical reasoning paths for complex problem-solving …

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

C Chen, Y Hu, W Wu, H Wang, ES Chng… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, text-to-speech (TTS) technology has witnessed impressive advancements,
particularly with large-scale training datasets, showcasing human-level speech quality and …

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

J Lu, J Li, S An, M Zhao, Y He, D Yin, X Sun - arXiv preprint arXiv …, 2024 - arxiv.org
Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct
and robust alignment of Large Language Models (LLMs) with human preferences, offering a …