Controlled decoding from language models

S Mudgal, J Lee, H Ganapathy, YG Li, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose controlled decoding (CD), a novel off-policy reinforcement learning method to
control the autoregressive generation from language models towards high reward …

Uncertainty-aware reward model: Teaching reward models to know what is unknown

X Lou, D Yan, W Shen, Y Yan, J Xie… - arXiv preprint arXiv …, 2024 - arxiv.org
Reward models (RM) play a critical role in aligning generations of large language models
(LLM) to human expectations. However, prevailing RMs fail to capture the stochasticity …

Filtered direct preference optimization

T Morimura, M Sakamoto, Y Jinnai, K Abe… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning
language models with human preferences. While the significance of dataset quality is …

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

K Wang, R Kidambi, R Sullivan, A Agarwal… - arXiv preprint arXiv …, 2024 - arxiv.org
Reward-based finetuning is crucial for aligning language policies with intended behaviors
(eg, creativity and safety). A key challenge is to develop steerable language models that …

The perfect blend: Redefining RLHF with mixture of judges

T Xu, E Helenowski, KA Sankararaman, D Jin… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has become the leading approach for
fine-tuning large language models (LLM). However, RLHF has limitations in multi-task …

Theoretical guarantees on the best-of-n alignment policy

A Beirami, A Agarwal, J Berant, A D'Amour… - arXiv preprint arXiv …, 2024 - arxiv.org
A simple and effective method for the alignment of generative models is the best-of-$ n $
policy, where $ n $ samples are drawn from a base policy, and ranked based on a reward …

Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

K D'Oosterlinck, W Xu, C Develder… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are often aligned using contrastive alignment objectives
and preference pair datasets. The interaction between model, paired data, and objective …

Robust preference optimization through reward model distillation

A Fisch, J Eisenstein, V Zayats, A Agarwal… - arXiv preprint arXiv …, 2024 - arxiv.org
Language model (LM) post-training (or alignment) involves maximizing a reward function
that is derived from preference annotations. Direct Preference Optimization (DPO) is a …

Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Z Liu, M Lu, S Zhang, B Liu, H Guo, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Aligning generative models with human preference via RLHF typically suffers from
overoptimization, where an imperfectly learned reward model can misguide the generative …

Overcoming reward overoptimization via adversarial policy optimization with lightweight uncertainty estimation

X Zhang, JF Ton, W Shen, H Wang, Y Liu - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive
issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) …