A self-play posterior sampling algorithm for zero-sum markov games

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net

This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

被引用次数：68 相关文章所有 3 个版本

[PDF] neurips.cc

Maximize to explore: One objective function fusing estimation, planning, and exploration

Z Liu, M Lu, W Xiong, H Zhong, H Hu… - Advances in …, 2024 - proceedings.neurips.cc

In reinforcement learning (RL), balancing exploration and exploitation is crucial for
achieving an optimal policy in a sample-efficient way. To this end, existing sample-efficient …

被引用次数：19 相关文章所有 6 个版本

[PDF] mlr.press

Breaking the curse of multiagency: Provably efficient decentralized multi-agent rl with function approximation

Y Wang, Q Liu, Y Bai, C Jin - The Thirty Sixth Annual …, 2023 - proceedings.mlr.press

A unique challenge in Multi-Agent Reinforcement Learning (MARL) is the\emph {curse of
multiagency}, where the description length of the game as well as the complexity of many …

被引用次数：31 相关文章所有 4 个版本

[PDF] arxiv.org

Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf

W Xiong, H Dong, C Ye, H Zhong, N Jiang… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

被引用次数：46 相关文章所有 2 个版本

[PDF] arxiv.org

Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game

W Xiong, H Zhong, C Shi, C Shen, L Wang… - arXiv preprint arXiv …, 2022 - arxiv.org

Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-
collected dataset without further interactions with the environment. While various algorithms …

被引用次数：49 相关文章所有 5 个版本

[PDF] arxiv.org

A theoretical analysis of nash learning from human feedback under general kl-regularized preference

C Ye, W Xiong, Y Zhang, N Jiang, T Zhang - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) learns from the preference signal
provided by a probabilistic preference model, which takes a prompt and two responses as …

被引用次数：27 相关文章所有 2 个版本

[PDF] arxiv.org

Gec: A unified framework for interactive decision making in mdp, pomdp, and beyond

H Zhong, W Xiong, S Zheng, L Wang, Z Wang… - arXiv preprint arXiv …, 2022 - arxiv.org

We study sample efficient reinforcement learning (RL) under the general framework of
interactive decision making, which includes Markov decision process (MDP), partially …

被引用次数：31 相关文章所有 3 个版本

[PDF] neurips.cc

On sample-efficient offline reinforcement learning: Data diversity, posterior sampling and beyond

T Nguyen-Tang, R Arora - Advances in neural information …, 2024 - proceedings.neurips.cc

We seek to understand what facilitates sample-efficient learning from historical datasets for
sequential decision-making, a problem that is popularly known as offline reinforcement …

被引用次数：7 相关文章所有 7 个版本

[PDF] neurips.cc

Posterior sampling for competitive RL: function approximation and partial observation

S Qiu, Z Dai, H Zhong, Z Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc

This paper investigates posterior sampling algorithms for competitive reinforcement learning
(RL) in the context of general function approximations. Focusing on zero-sum Markov games …

被引用次数：3 相关文章所有 8 个版本

[PDF] github.io

[PDF][PDF] One objective to rule them all: A maximization objective fusing estimation and planning for exploration

Z Liu, M Lu, W Xiong, H Zhong, H Hu… - arXiv preprint arXiv …, 2023 - miaolu3.github.io

In online reinforcement learning (online RL), balancing exploration and exploitation is
crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing …

被引用次数：15 相关文章

高级搜索

QQ 群