- 学术资源搜索

Controlled decoding from language models

S Mudgal, J Lee, H Ganapathy, YG Li, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

We propose controlled decoding (CD), a novel off-policy reinforcement learning method to
control the autoregressive generation from language models towards high reward …

被引用次数：55 相关文章所有 4 个版本

[PDF] arxiv.org

Uncertainty-aware reward model: Teaching reward models to know what is unknown

X Lou, D Yan, W Shen, Y Yan, J Xie… - arXiv preprint arXiv …, 2024 - arxiv.org

Reward models (RM) play a critical role in aligning generations of large language models
(LLM) to human expectations. However, prevailing RMs fail to capture the stochasticity …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

Filtered direct preference optimization

T Morimura, M Sakamoto, Y Jinnai, K Abe… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning
language models with human preferences. While the significance of dataset quality is …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

K Wang, R Kidambi, R Sullivan, A Agarwal… - arXiv preprint arXiv …, 2024 - arxiv.org

Reward-based finetuning is crucial for aligning language policies with intended behaviors
(eg, creativity and safety). A key challenge is to develop steerable language models that …

被引用次数：7 相关文章所有 5 个版本

[PDF] arxiv.org

The perfect blend: Redefining RLHF with mixture of judges

T Xu, E Helenowski, KA Sankararaman, D Jin… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) has become the leading approach for
fine-tuning large language models (LLM). However, RLHF has limitations in multi-task …

被引用次数：7 相关文章所有 4 个版本

[PDF] arxiv.org

Theoretical guarantees on the best-of-n alignment policy

A Beirami, A Agarwal, J Berant, A D'Amour… - arXiv preprint arXiv …, 2024 - arxiv.org

A simple and effective method for the alignment of generative models is the best-of-$ n $
policy, where $ n $ samples are drawn from a base policy, and ranked based on a reward …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

K D'Oosterlinck, W Xu, C Develder… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) are often aligned using contrastive alignment objectives
and preference pair datasets. The interaction between model, paired data, and objective …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Robust preference optimization through reward model distillation

A Fisch, J Eisenstein, V Zayats, A Agarwal… - arXiv preprint arXiv …, 2024 - arxiv.org

Language model (LM) post-training (or alignment) involves maximizing a reward function
that is derived from preference annotations. Direct Preference Optimization (DPO) is a …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Z Liu, M Lu, S Zhang, B Liu, H Guo, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Aligning generative models with human preference via RLHF typically suffers from
overoptimization, where an imperfectly learned reward model can misguide the generative …

被引用次数：22 相关文章所有 2 个版本

[PDF] arxiv.org

Overcoming reward overoptimization via adversarial policy optimization with lightweight uncertainty estimation

X Zhang, JF Ton, W Shen, H Wang, Y Liu - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive
issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) …

被引用次数：9 相关文章所有 3 个版本

高级搜索

QQ 群

Controlled decoding from language models

Uncertainty-aware reward model: Teaching reward models to know what is unknown

Filtered direct preference optimization

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

The perfect blend: Redefining RLHF with mixture of judges

Theoretical guarantees on the best-of-n alignment policy

Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

Robust preference optimization through reward model distillation

Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Overcoming reward overoptimization via adversarial policy optimization with lightweight uncertainty estimation

引用