Direct preference optimization: Your language model is secretly a reward model

R Rafailov, A Sharma, E Mitchell… - Advances in …, 2024 - proceedings.neurips.cc
While large-scale unsupervised language models (LMs) learn broad world knowledge and
some reasoning skills, achieving precise control of their behavior is difficult due to the …

Roboclip: One demonstration is enough to learn robot policies

S Sontakke, J Zhang, S Arnold… - Advances in …, 2024 - proceedings.neurips.cc
Reward specification is a notoriously difficult problem in reinforcement learning, requiring
extensive expert supervision to design robust reward functions. Imitation learning (IL) …

Inverse preference learning: Preference-based rl without a reward function

J Hejna, D Sadigh - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Reward functions are difficult to design and often hard to align with human intent. Preference-
based Reinforcement Learning (RL) algorithms address these problems by learning reward …

Learning from active human involvement through proxy value propagation

ZM Peng, W Mo, C Duan, Q Li… - Advances in neural …, 2024 - proceedings.neurips.cc
Learning from active human involvement enables the human subject to actively intervene
and demonstrate to the AI agent during training. The interaction and corrective feedback …

Promptable behaviors: Personalizing multi-objective rewards from human preferences

M Hwang, L Weihs, C Park, K Lee… - Proceedings of the …, 2024 - openaccess.thecvf.com
Customizing robotic behaviors to be aligned with diverse human preferences is an
underexplored challenge in the field of embodied AI. In this paper we present Promptable …

Active preference-based Gaussian process regression for reward learning and optimization

E Bıyık, N Huynh, MJ Kochenderfer… - … Journal of Robotics …, 2024 - journals.sagepub.com
Designing reward functions is a difficult task in AI and robotics. The complex task of directly
specifying all the desirable behaviors a robot needs to optimize often proves challenging for …

Feedback loops with language models drive in-context reward hacking

A Pan, E Jones, M Jagadeesan, J Steinhardt - arXiv preprint arXiv …, 2024 - arxiv.org
Language models influence the external world: they query APIs that read and write to web
pages, generate content that shapes human behavior, and run system commands as …

Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf

B Zhu, MI Jordan, J Jiao - arXiv preprint arXiv:2401.16335, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns
language models closely with human-centric values. The initial phase of RLHF involves …

Preventing reward hacking with occupancy measure regularization

C Laidlaw, S Singhal, A Dragan - arXiv preprint arXiv:2403.03185, 2024 - arxiv.org
Reward hacking occurs when an agent performs very well with respect to a" proxy" reward
function (which may be hand-specified or learned), but poorly with respect to the unknown …

Learning optimal advantage from preferences and mistaking it for reward

WB Knox, S Hatgis-Kessell, SO Adalgeirsson… - Proceedings of the …, 2024 - ojs.aaai.org
We consider algorithms for learning reward functions from human preferences over pairs of
trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most …