Simpo: Simple preference optimization with a reference-free reward

Y Meng, M Xia, D Chen - arXiv preprint arXiv:2405.14734, 2024 - arxiv.org
Direct Preference Optimization (DPO) is a widely used offline preference optimization
algorithm that reparameterizes reward functions in reinforcement learning from human …

Smaug: Fixing failure modes of preference optimisation with dpo-positive

A Pal, D Karkhanis, S Dooley, M Roberts… - arXiv preprint arXiv …, 2024 - arxiv.org
Direct Preference Optimisation (DPO) is effective at significantly improving the performance
of large language models (LLMs) on downstream tasks such as reasoning, summarisation …

Iterative reasoning preference optimization

RY Pang, W Yuan, K Cho, H He, S Sukhbaatar… - arXiv preprint arXiv …, 2024 - arxiv.org
Iterative preference optimization methods have recently been shown to perform well for
general instruction tuning tasks, but typically make little improvement on reasoning tasks …

Large language models meet nlp: A survey

L Qin, Q Chen, X Feng, Y Wu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
While large language models (LLMs) like ChatGPT have shown impressive capabilities in
Natural Language Processing (NLP) tasks, a systematic investigation of their potential in this …

Improving machine translation with human feedback: An exploration of quality estimation as a reward model

Z He, X Wang, W Jiao, Z Zhang, R Wang, S Shi… - arXiv preprint arXiv …, 2024 - arxiv.org
Insufficient modeling of human preferences within the reward model is a major obstacle for
leveraging human feedback to improve translation quality. Fortunately, quality estimation …

Amuro & char: Analyzing the relationship between pre-training and fine-tuning of large language models

K Sun, M Dredze - arXiv preprint arXiv:2408.06663, 2024 - arxiv.org
The development of large language models leads to the formation of a pre-train-then-align
paradigm, in which the model is typically pre-trained on a large text corpus and undergoes a …

[HTML][HTML] To prompt or not to prompt: Navigating the use of large language models for integrating and modeling heterogeneous data

A Remadi, K El Hage, Y Hobeika, F Bugiotti - Data & Knowledge …, 2024 - Elsevier
Manually integrating data of diverse formats and languages is vital to many artificial
intelligence applications. However, the task itself remains challenging and time-consuming …

Towards analyzing and understanding the limitations of dpo: A theoretical perspective

D Feng, B Qin, C Huang, Z Zhang, W Lei - arXiv preprint arXiv:2404.04626, 2024 - arxiv.org
Direct Preference Optimization (DPO), which derives reward signals directly from pairwise
preference data, has shown its effectiveness on aligning Large Language Models (LLMs) …

Preference tuning for toxicity mitigation generalizes across languages

X Li, ZX Yong, SH Bach - arXiv preprint arXiv:2406.16235, 2024 - arxiv.org
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their
increasing global use. In this work, we explore zero-shot cross-lingual generalization of …

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

Q He, J Zeng, Q He, J Liang, Y Xiao - arXiv preprint arXiv:2404.15846, 2024 - arxiv.org
It is imperative for Large language models (LLMs) to follow instructions with elaborate
requirements (ie Complex Instructions Following). Yet, it remains under-explored how to …