Self-generated critiques boost reward modeling for language models

Y Yu, Z Chen, A Zhang, L Tan, C Zhu, RY Pang… - arXiv preprint arXiv …, 2024 - arxiv.org
Reward modeling is crucial for aligning large language models (LLMs) with human
preferences, especially in reinforcement learning from human feedback (RLHF). However …

Self-Consistency Preference Optimization

A Prasad, W Yuan, RY Pang, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Self-alignment, whereby models learn to improve themselves without human annotation, is a
rapidly growing research area. However, existing techniques often fail to improve complex …

Natural language reinforcement learning

X Feng, Z Wan, H Fu, B Liu, M Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning (RL) mathematically formulates decision-making with Markov
Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs …

Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search

J Jiang, Z Chen, Y Min, J Chen, X Cheng… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, test-time scaling has garnered significant attention from the research community,
largely due to the substantial advancements of the o1 model released by OpenAI. By …

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though

V Xiang, C Snell, K Gandhi, A Albalak, A Singh… - arXiv preprint arXiv …, 2025 - arxiv.org
We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends
traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required …