Generative reward models

文章

学术资源搜索

获得 5 条结果（用时0.04秒）

我的图书馆

在引用文章中搜索

[PDF] arxiv.org

Self-generated critiques boost reward modeling for language models

Y Yu, Z Chen, A Zhang, L Tan, C Zhu, RY Pang… - arXiv preprint arXiv …, 2024 - arxiv.org

Reward modeling is crucial for aligning large language models (LLMs) with human
preferences, especially in reinforcement learning from human feedback (RLHF). However …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Self-Consistency Preference Optimization

A Prasad, W Yuan, RY Pang, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Self-alignment, whereby models learn to improve themselves without human annotation, is a
rapidly growing research area. However, existing techniques often fail to improve complex …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Natural language reinforcement learning

X Feng, Z Wan, H Fu, B Liu, M Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement Learning (RL) mathematically formulates decision-making with Markov
Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search

J Jiang, Z Chen, Y Min, J Chen, X Cheng… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently, test-time scaling has garnered significant attention from the research community,
largely due to the substantial advancements of the o1 model released by OpenAI. By …

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though

V Xiang, C Snell, K Gandhi, A Albalak, A Singh… - arXiv preprint arXiv …, 2025 - arxiv.org

We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends
traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required …

高级搜索

QQ 群