Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function often requires …
Machine translation is a natural candidate problem for reinforcement learning from human feedback: users provide quick, dirty ratings on candidate translations to guide a system to …
Children start to communicate and use language in social interactions from a very young age. This allows them to experiment with their developing linguistic knowledge and receive …
We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine …
Insufficient modeling of human preferences within the reward model is a major obstacle for leveraging human feedback to improve translation quality. Fortunately, quality estimation …
We propose a method to perform automatic document summarisation without using reference summaries. Instead, our method interactively learns from users' preferences. The …
Z Yao, Y Tang, W Yih, H Sun, Y Su - arXiv preprint arXiv:2005.00689, 2020 - arxiv.org
Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers are still a tedious process with challenges such as costly data annotation and privacy risks …
G Gao, E Choi, Y Artzi - arXiv preprint arXiv:2203.10079, 2022 - arxiv.org
We study learning from user feedback for extractive question answering by simulating feedback using supervised data. We cast the problem as contextual bandit learning, and …
Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback. This feedback is received in the form of a task loss …