In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
In reinforcement learning (RL), balancing exploration and exploitation is crucial for achieving an optimal policy in a sample-efficient way. To this end, existing sample-efficient …
Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models …
Recent studies have shown that large language models'(LLMs) mathematical problem- solving capabilities can be enhanced by integrating external tools, such as code …
We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi $, how many rounds of interaction with an unknown MDP (with a potentially large state and …
R Wu, W Sun - arXiv preprint arXiv:2310.14554, 2023 - arxiv.org
Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be efficient in terms of statistical complexity, computational complexity, and query complexity. In …
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart …
S Qiu, Z Dai, H Zhong, Z Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc
This paper investigates posterior sampling algorithms for competitive reinforcement learning (RL) in the context of general function approximations. Focusing on zero-sum Markov games …
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it is unclear how …