Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W Xiong, X Cheng, L Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Maximize to explore: One objective function fusing estimation, planning, and exploration

Z Liu, M Lu, W Xiong, H Zhong, H Hu… - Advances in …, 2024 - proceedings.neurips.cc
In reinforcement learning (RL), balancing exploration and exploitation is crucial for
achieving an optimal policy in a sample-efficient way. To this end, existing sample-efficient …

Self-exploring language models: Active preference elicitation for online alignment

S Zhang, D Yu, H Sharma, H Zhong, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Preference optimization, particularly through Reinforcement Learning from Human
Feedback (RLHF), has achieved significant success in aligning Large Language Models …

Building math agents with multi-turn iterative preference learning

W Xiong, C Shi, J Shen, A Rosenberg, Z Qin… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent studies have shown that large language models'(LLMs) mathematical problem-
solving capabilities can be enhanced by integrating external tools, such as code …

When is agnostic reinforcement learning statistically tractable?

Z Jia, G Li, A Rakhlin, A Sekhari… - Advances in Neural …, 2024 - proceedings.neurips.cc
We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi
$, how many rounds of interaction with an unknown MDP (with a potentially large state and …

Making rl with preference-based feedback efficient via randomization

R Wu, W Sun - arXiv preprint arXiv:2310.14554, 2023 - arxiv.org
Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be
efficient in terms of statistical complexity, computational complexity, and query complexity. In …

Rlhf workflow: From reward modeling to online rlhf

H Dong, W Xiong, B Pang, H Wang, H Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

Posterior sampling for competitive RL: function approximation and partial observation

S Qiu, Z Dai, H Zhong, Z Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc
This paper investigates posterior sampling algorithms for competitive reinforcement learning
(RL) in the context of general function approximations. Focusing on zero-sum Markov games …

Reason for future, act for now: A principled architecture for autonomous llm agents

Z Liu, H Hu, S Zhang, H Guo, S Ke, B Liu… - Forty-first International …, 2023 - openreview.net
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating
reasoning into actions in the real world remains challenging. In particular, it is unclear how …