A survey of preference-based online learning with bandit algorithms

C Wirth, R Akrour, G Neumann, J Fürnkranz - Journal of Machine Learning …, 2017 - jmlr.org

Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a
suitably chosen reward function. However, designing such a reward function often requires …

被引用次数：433 相关文章所有 10 个版本

[PDF] arxiv.org

Distributional preference learning: Understanding and accounting for hidden context in RLHF

A Siththaranjan, C Laidlaw… - arXiv preprint arXiv …, 2023 - arxiv.org

In practice, preference learning from human feedback depends on incomplete data with
hidden context. Hidden context refers to data that affects the feedback received, but which is …

被引用次数：35 相关文章所有 3 个版本

[PDF] mlr.press

Spectral mle: Top-k rank aggregation from pairwise comparisons

Y Chen, C Suh - International Conference on Machine …, 2015 - proceedings.mlr.press

This paper explores the preference-based top-K rank aggregation problem. Suppose that a
collection of items is repeatedly compared in pairs, and one wishes to recover a consistent …

被引用次数：183 相关文章所有 13 个版本

[PDF] jmlr.org

Preference-based online learning with dueling bandits: A survey

V Bengs, R Busa-Fekete, A El Mesaoudi-Paul… - Journal of Machine …, 2021 - jmlr.org

In machine learning, the notion of multi-armed bandits refers to a class of online learning
problems, in which an agent is supposed to simultaneously explore and exploit a given set …

被引用次数：119 相关文章所有 7 个版本

[PDF] mlr.press

Learning multimodal rewards from rankings

V Myers, E Biyik, N Anari… - Conference on robot …, 2022 - proceedings.mlr.press

Learning from human feedback has shown to be a useful approach in acquiring robot
reward functions. However, expert feedback is often assumed to be drawn from an …

被引用次数：55 相关文章所有 9 个版本

[PDF] mlr.press

Efficient and optimal algorithms for contextual dueling bandits under realizability

A Saha, A Krishnamurthy - International Conference on …, 2022 - proceedings.mlr.press

We study the $ K $-armed contextual dueling bandit problem, a sequential decision making
setting in which the learner uses contextual information to make two decisions, but only …

被引用次数：39 相关文章所有 3 个版本

[PDF] mlr.press

Contextual dueling bandits

M Dudík, K Hofmann, RE Schapire… - … on Learning Theory, 2015 - proceedings.mlr.press

We consider the problem of learning to choose actions using contextual information when
provided with limited feedback in the form of relative pairwise comparisons. We study this …

被引用次数：123 相关文章所有 8 个版本

[PDF] neurips.cc

Copeland dueling bandits

M Zoghi, ZS Karnin, S Whiteson… - Advances in neural …, 2015 - proceedings.neurips.cc

A version of the dueling bandit problem is addressed in which a Condorcet winner may not
exist. Two algorithms are proposed that instead seek to minimize regret with respect to the …

被引用次数：112 相关文章所有 13 个版本

[PDF] neurips.cc

Online rank elicitation for plackett-luce: A dueling bandits approach

B Szörényi, R Busa-Fekete, A Paul… - Advances in neural …, 2015 - proceedings.neurips.cc

We study the problem of online rank elicitation, assuming that rankings of a set of
alternatives obey the Plackett-Luce distribution. Following the setting of the dueling bandits …

被引用次数：104 相关文章所有 13 个版本

[PDF] ijcai.org

[PDF][PDF] Advancements in Dueling Bandits.

Y Sui, M Zoghi, K Hofmann, Y Yue - IJCAI, 2018 - ijcai.org

The dueling bandits problem is an online learning framework where learning happens “on-
thefly” through preference feedback, ie, from comparisons between a pair of actions. Unlike …

被引用次数：77 相关文章所有 5 个版本

高级搜索

QQ 群