Contrastive preference optimization: Pushing the boundaries of llm performance in machine...

Y Meng, M Xia, D Chen - arXiv preprint arXiv:2405.14734, 2024 - arxiv.org

Direct Preference Optimization (DPO) is a widely used offline preference optimization
algorithm that reparameterizes reward functions in reinforcement learning from human …

被引用次数：58 相关文章所有 2 个版本

[PDF] arxiv.org

Smaug: Fixing failure modes of preference optimisation with dpo-positive

A Pal, D Karkhanis, S Dooley, M Roberts… - arXiv preprint arXiv …, 2024 - arxiv.org

Direct Preference Optimisation (DPO) is effective at significantly improving the performance
of large language models (LLMs) on downstream tasks such as reasoning, summarisation …

被引用次数：42 相关文章所有 2 个版本

[PDF] arxiv.org

Iterative reasoning preference optimization

RY Pang, W Yuan, K Cho, H He, S Sukhbaatar… - arXiv preprint arXiv …, 2024 - arxiv.org

Iterative preference optimization methods have recently been shown to perform well for
general instruction tuning tasks, but typically make little improvement on reasoning tasks …

被引用次数：23 相关文章所有 3 个版本

[PDF] arxiv.org

Large language models meet nlp: A survey

L Qin, Q Chen, X Feng, Y Wu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org

While large language models (LLMs) like ChatGPT have shown impressive capabilities in
Natural Language Processing (NLP) tasks, a systematic investigation of their potential in this …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Improving machine translation with human feedback: An exploration of quality estimation as a reward model

Z He, X Wang, W Jiao, Z Zhang, R Wang, S Shi… - arXiv preprint arXiv …, 2024 - arxiv.org

Insufficient modeling of human preferences within the reward model is a major obstacle for
leveraging human feedback to improve translation quality. Fortunately, quality estimation …

被引用次数：7 相关文章所有 5 个版本

[PDF] arxiv.org

Amuro & char: Analyzing the relationship between pre-training and fine-tuning of large language models

K Sun, M Dredze - arXiv preprint arXiv:2408.06663, 2024 - arxiv.org

The development of large language models leads to the formation of a pre-train-then-align
paradigm, in which the model is typically pre-trained on a large text corpus and undergoes a …

被引用次数：3 相关文章所有 2 个版本

[HTML] sciencedirect.com

[HTML][HTML] To prompt or not to prompt: Navigating the use of large language models for integrating and modeling heterogeneous data

A Remadi, K El Hage, Y Hobeika, F Bugiotti - Data & Knowledge …, 2024 - Elsevier

Manually integrating data of diverse formats and languages is vital to many artificial
intelligence applications. However, the task itself remains challenging and time-consuming …

被引用次数：5 相关文章

[PDF] arxiv.org

Towards analyzing and understanding the limitations of dpo: A theoretical perspective

D Feng, B Qin, C Huang, Z Zhang, W Lei - arXiv preprint arXiv:2404.04626, 2024 - arxiv.org

Direct Preference Optimization (DPO), which derives reward signals directly from pairwise
preference data, has shown its effectiveness on aligning Large Language Models (LLMs) …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Preference tuning for toxicity mitigation generalizes across languages

X Li, ZX Yong, SH Bach - arXiv preprint arXiv:2406.16235, 2024 - arxiv.org

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their
increasing global use. In this work, we explore zero-shot cross-lingual generalization of …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

Q He, J Zeng, Q He, J Liang, Y Xiao - arXiv preprint arXiv:2404.15846, 2024 - arxiv.org

It is imperative for Large language models (LLMs) to follow instructions with elaborate
requirements (ie Complex Instructions Following). Yet, it remains under-explored how to …

被引用次数：5 相关文章所有 2 个版本

高级搜索

QQ 群