- 学术资源搜索

Self-rewarding language models

W Yuan, RY Pang, K Cho, S Sukhbaatar, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

We posit that to achieve superhuman agents, future models require superhuman feedback
in order to provide an adequate training signal. Current approaches commonly train reward …

被引用次数：288 相关文章所有 4 个版本

[PDF] arxiv.org

A comprehensive survey of datasets, theories, variants, and applications in direct preference optimization

W Xiao, Z Wang, L Gan, S Zhao, W He, LA Tuan… - arXiv preprint arXiv …, 2024 - arxiv.org

With the rapid advancement of large language models (LLMs), aligning policy models with
human preferences has become increasingly critical. Direct Preference Optimization (DPO) …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Strengthening multimodal large language model with bootstrapped preference optimization

R Pi, T Han, W Xiong, J Zhang, R Liu, R Pan… - … on Computer Vision, 2025 - Springer

Abstract Multimodal Large Language Models (MLLMs) excel in generating responses based
on visual inputs. However, they often suffer from a bias towards generating responses …

被引用次数：19 相关文章所有 2 个版本

[PDF] arxiv.org

Mllm-protector: Ensuring mllm's safety without hurting performance

R Pi, T Han, J Zhang, Y Xie, R Pan, Q Lian… - arXiv preprint arXiv …, 2024 - arxiv.org

The deployment of multimodal large language models (MLLMs) has brought forth a unique
vulnerability: susceptibility to malicious attacks through visual inputs. This paper investigates …

被引用次数：43 相关文章所有 3 个版本

[PDF] arxiv.org

Direct nash optimization: Teaching language models to self-improve with general preferences

C Rosset, CA Cheng, A Mitra, M Santacroce… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper studies post-training large language models (LLMs) using preference feedback
from a powerful oracle to help a model iteratively improve over itself. The typical approach …

被引用次数：71 相关文章所有 2 个版本

[PDF] arxiv.org

A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

被引用次数：112 相关文章所有 4 个版本

[PDF] arxiv.org

Direct language model alignment from online ai feedback

S Guo, B Zhang, T Liu, T Liu, M Khalman… - arXiv preprint arXiv …, 2024 - arxiv.org

Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as
efficient alternatives to reinforcement learning from human feedback (RLHF), that do not …

被引用次数：88 相关文章所有 2 个版本

[PDF] arxiv.org

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W Xiong, X Cheng, L Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

被引用次数：28 相关文章所有 2 个版本

[PDF] arxiv.org

Self-exploring language models: Active preference elicitation for online alignment

S Zhang, D Yu, H Sharma, H Zhong, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Preference optimization, particularly through Reinforcement Learning from Human
Feedback (RLHF), has achieved significant success in aligning Large Language Models …

被引用次数：16 相关文章所有 3 个版本

[PDF] aclanthology.org

Mitigating the alignment tax of rlhf

Y Lin, H Lin, W Xiong, S Diao, J Liu… - Proceedings of the …, 2024 - aclanthology.org

LLMs acquire a wide range of abilities during pre-training, but aligning LLMs under
Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained …

被引用次数：8 相关文章

高级搜索

QQ 群

Self-rewarding language models

A comprehensive survey of datasets, theories, variants, and applications in direct preference optimization

Strengthening multimodal large language model with bootstrapped preference optimization

Mllm-protector: Ensuring mllm's safety without hurting performance

Direct nash optimization: Teaching language models to self-improve with general preferences

A survey of reinforcement learning from human feedback

Direct language model alignment from online ai feedback

Dpo meets ppo: Reinforced token optimization for rlhf

Self-exploring language models: Active preference elicitation for online alignment

Mitigating the alignment tax of rlhf

引用