A survey of preference-based reinforcement learning methods

C Wirth, R Akrour, G Neumann, J Fürnkranz - Journal of Machine Learning …, 2017 - jmlr.org
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a
suitably chosen reward function. However, designing such a reward function often requires …

Distributional preference learning: Understanding and accounting for hidden context in RLHF

A Siththaranjan, C Laidlaw… - arXiv preprint arXiv …, 2023 - arxiv.org
In practice, preference learning from human feedback depends on incomplete data with
hidden context. Hidden context refers to data that affects the feedback received, but which is …

Spectral mle: Top-k rank aggregation from pairwise comparisons

Y Chen, C Suh - International Conference on Machine …, 2015 - proceedings.mlr.press
This paper explores the preference-based top-K rank aggregation problem. Suppose that a
collection of items is repeatedly compared in pairs, and one wishes to recover a consistent …

Preference-based online learning with dueling bandits: A survey

V Bengs, R Busa-Fekete, A El Mesaoudi-Paul… - Journal of Machine …, 2021 - jmlr.org
In machine learning, the notion of multi-armed bandits refers to a class of online learning
problems, in which an agent is supposed to simultaneously explore and exploit a given set …

Learning multimodal rewards from rankings

V Myers, E Biyik, N Anari… - Conference on robot …, 2022 - proceedings.mlr.press
Learning from human feedback has shown to be a useful approach in acquiring robot
reward functions. However, expert feedback is often assumed to be drawn from an …

Efficient and optimal algorithms for contextual dueling bandits under realizability

A Saha, A Krishnamurthy - International Conference on …, 2022 - proceedings.mlr.press
We study the $ K $-armed contextual dueling bandit problem, a sequential decision making
setting in which the learner uses contextual information to make two decisions, but only …

Contextual dueling bandits

M Dudík, K Hofmann, RE Schapire… - … on Learning Theory, 2015 - proceedings.mlr.press
We consider the problem of learning to choose actions using contextual information when
provided with limited feedback in the form of relative pairwise comparisons. We study this …

Copeland dueling bandits

M Zoghi, ZS Karnin, S Whiteson… - Advances in neural …, 2015 - proceedings.neurips.cc
A version of the dueling bandit problem is addressed in which a Condorcet winner may not
exist. Two algorithms are proposed that instead seek to minimize regret with respect to the …

Online rank elicitation for plackett-luce: A dueling bandits approach

B Szörényi, R Busa-Fekete, A Paul… - Advances in neural …, 2015 - proceedings.neurips.cc
We study the problem of online rank elicitation, assuming that rankings of a set of
alternatives obey the Plackett-Luce distribution. Following the setting of the dueling bandits …

[PDF][PDF] Advancements in Dueling Bandits.

Y Sui, M Zoghi, K Hofmann, Y Yue - IJCAI, 2018 - ijcai.org
The dueling bandits problem is an online learning framework where learning happens “on-
thefly” through preference feedback, ie, from comparisons between a pair of actions. Unlike …