Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

Maximize to explore: One objective function fusing estimation, planning, and exploration

Z Liu, M Lu, W Xiong, H Zhong, H Hu… - Advances in …, 2024 - proceedings.neurips.cc
In reinforcement learning (RL), balancing exploration and exploitation is crucial for
achieving an optimal policy in a sample-efficient way. To this end, existing sample-efficient …

Breaking the curse of multiagency: Provably efficient decentralized multi-agent rl with function approximation

Y Wang, Q Liu, Y Bai, C Jin - The Thirty Sixth Annual …, 2023 - proceedings.mlr.press
A unique challenge in Multi-Agent Reinforcement Learning (MARL) is the\emph {curse of
multiagency}, where the description length of the game as well as the complexity of many …

Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf

W Xiong, H Dong, C Ye, H Zhong, N Jiang… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game

W Xiong, H Zhong, C Shi, C Shen, L Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-
collected dataset without further interactions with the environment. While various algorithms …

A theoretical analysis of nash learning from human feedback under general kl-regularized preference

C Ye, W Xiong, Y Zhang, N Jiang, T Zhang - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) learns from the preference signal
provided by a probabilistic preference model, which takes a prompt and two responses as …

Gec: A unified framework for interactive decision making in mdp, pomdp, and beyond

H Zhong, W Xiong, S Zheng, L Wang, Z Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
We study sample efficient reinforcement learning (RL) under the general framework of
interactive decision making, which includes Markov decision process (MDP), partially …

On sample-efficient offline reinforcement learning: Data diversity, posterior sampling and beyond

T Nguyen-Tang, R Arora - Advances in neural information …, 2024 - proceedings.neurips.cc
We seek to understand what facilitates sample-efficient learning from historical datasets for
sequential decision-making, a problem that is popularly known as offline reinforcement …

Posterior sampling for competitive RL: function approximation and partial observation

S Qiu, Z Dai, H Zhong, Z Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc
This paper investigates posterior sampling algorithms for competitive reinforcement learning
(RL) in the context of general function approximations. Focusing on zero-sum Markov games …

[PDF][PDF] One objective to rule them all: A maximization objective fusing estimation and planning for exploration

Z Liu, M Lu, W Xiong, H Zhong, H Hu… - arXiv preprint arXiv …, 2023 - miaolu3.github.io
In online reinforcement learning (online RL), balancing exploration and exploitation is
crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing …