Survey on reinforcement learning for language processing

V Uc-Cetina, N Navarro-Guerrero… - Artificial Intelligence …, 2023 - Springer
In recent years some researchers have explored the use of reinforcement learning (RL)
algorithms as key components in the solution of various natural language processing (NLP) …

A survey of preference-based reinforcement learning methods

C Wirth, R Akrour, G Neumann, J Fürnkranz - Journal of Machine Learning …, 2017 - jmlr.org
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a
suitably chosen reward function. However, designing such a reward function often requires …

Reinforcement learning for bandit neural machine translation with simulated human feedback

K Nguyen, H Daumé III, J Boyd-Graber - arXiv preprint arXiv:1707.07402, 2017 - arxiv.org
Machine translation is a natural candidate problem for reinforcement learning from human
feedback: users provide quick, dirty ratings on candidate translations to guide a system to …

Communicative feedback in language acquisition

M Nikolaus, A Fourtassi - New Ideas in Psychology, 2023 - Elsevier
Children start to communicate and use language in social interactions from a very young
age. This allows them to experiment with their developing linguistic knowledge and receive …

Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning

J Kreutzer, J Uyheng, S Riezler - arXiv preprint arXiv:1805.10627, 2018 - arxiv.org
We present a study on reinforcement learning (RL) from human bandit feedback for
sequence-to-sequence learning, exemplified by the task of bandit neural machine …

Improving machine translation with human feedback: An exploration of quality estimation as a reward model

Z He, X Wang, W Jiao, Z Zhang, R Wang, S Shi… - arXiv preprint arXiv …, 2024 - arxiv.org
Insufficient modeling of human preferences within the reward model is a major obstacle for
leveraging human feedback to improve translation quality. Fortunately, quality estimation …

APRIL: Interactively learning to summarise by combining active preference learning and reinforcement learning

Y Gao, CM Meyer, I Gurevych - arXiv preprint arXiv:1808.09658, 2018 - arxiv.org
We propose a method to perform automatic document summarisation without using
reference summaries. Instead, our method interactively learns from users' preferences. The …

An imitation game for learning semantic parsers from user interaction

Z Yao, Y Tang, W Yih, H Sun, Y Su - arXiv preprint arXiv:2005.00689, 2020 - arxiv.org
Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers
are still a tedious process with challenges such as costly data annotation and privacy risks …

Simulating bandit learning from user feedback for extractive question answering

G Gao, E Choi, Y Artzi - arXiv preprint arXiv:2203.10079, 2022 - arxiv.org
We study learning from user feedback for extractive question answering by simulating
feedback using supervised data. We cast the problem as contextual bandit learning, and …

Bandit structured prediction for neural sequence-to-sequence learning

J Kreutzer, A Sokolov, S Riezler - arXiv preprint arXiv:1704.06497, 2017 - arxiv.org
Bandit structured prediction describes a stochastic optimization framework where learning is
performed from partial feedback. This feedback is received in the form of a task loss …