- 学术资源搜索

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net

This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

被引用次数：68 相关文章所有 3 个版本

[PDF] arxiv.org

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W Xiong, X Cheng, L Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

被引用次数：28 相关文章所有 2 个版本

[PDF] neurips.cc

Maximize to explore: One objective function fusing estimation, planning, and exploration

Z Liu, M Lu, W Xiong, H Zhong, H Hu… - Advances in …, 2024 - proceedings.neurips.cc

In reinforcement learning (RL), balancing exploration and exploitation is crucial for
achieving an optimal policy in a sample-efficient way. To this end, existing sample-efficient …

被引用次数：19 相关文章所有 6 个版本

[PDF] arxiv.org

Self-exploring language models: Active preference elicitation for online alignment

S Zhang, D Yu, H Sharma, H Zhong, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Preference optimization, particularly through Reinforcement Learning from Human
Feedback (RLHF), has achieved significant success in aligning Large Language Models …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

Building math agents with multi-turn iterative preference learning

W Xiong, C Shi, J Shen, A Rosenberg, Z Qin… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent studies have shown that large language models'(LLMs) mathematical problem-
solving capabilities can be enhanced by integrating external tools, such as code …

被引用次数：11 相关文章所有 2 个版本

[PDF] neurips.cc

When is agnostic reinforcement learning statistically tractable?

Z Jia, G Li, A Rakhlin, A Sekhari… - Advances in Neural …, 2024 - proceedings.neurips.cc

We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi
$, how many rounds of interaction with an unknown MDP (with a potentially large state and …

被引用次数：4 相关文章所有 6 个版本

[PDF] arxiv.org

Making rl with preference-based feedback efficient via randomization

R Wu, W Sun - arXiv preprint arXiv:2310.14554, 2023 - arxiv.org

Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be
efficient in terms of statistical complexity, computational complexity, and query complexity. In …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Rlhf workflow: From reward modeling to online rlhf

H Dong, W Xiong, B Pang, H Wang, H Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

被引用次数：65 相关文章所有 2 个版本

[PDF] neurips.cc

Posterior sampling for competitive RL: function approximation and partial observation

S Qiu, Z Dai, H Zhong, Z Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc

This paper investigates posterior sampling algorithms for competitive reinforcement learning
(RL) in the context of general function approximations. Focusing on zero-sum Markov games …

被引用次数：3 相关文章所有 8 个版本

[PDF] openreview.net

Reason for future, act for now: A principled architecture for autonomous llm agents

Z Liu, H Hu, S Zhang, H Guo, S Ke, B Liu… - Forty-first International …, 2023 - openreview.net

Large language models (LLMs) demonstrate impressive reasoning abilities, but translating
reasoning into actions in the real world remains challenging. In particular, it is unclear how …

被引用次数：5 相关文章所有 2 个版本

高级搜索

QQ 群

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

Dpo meets ppo: Reinforced token optimization for rlhf

Maximize to explore: One objective function fusing estimation, planning, and exploration

Self-exploring language models: Active preference elicitation for online alignment

Building math agents with multi-turn iterative preference learning

When is agnostic reinforcement learning statistically tractable?

Making rl with preference-based feedback efficient via randomization

Rlhf workflow: From reward modeling to online rlhf

Posterior sampling for competitive RL: function approximation and partial observation

Reason for future, act for now: A principled architecture for autonomous llm agents

引用