Theoretical guarantees on the best-of-n alignment policy

A Albalak, Y Elazar, SM Xie, S Longpre… - arXiv preprint arXiv …, 2024 - arxiv.org

A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

被引用次数：71 相关文章所有 2 个版本

[PDF] openreview.net

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net

This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

被引用次数：71 相关文章所有 3 个版本

[PDF] arxiv.org

Controlled decoding from language models

S Mudgal, J Lee, H Ganapathy, YG Li, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

We propose controlled decoding (CD), a novel off-policy reinforcement learning method to
control the autoregressive generation from language models towards high reward …

被引用次数：55 相关文章所有 4 个版本

[PDF] arxiv.org

Rlhf workflow: From reward modeling to online rlhf

H Dong, W Xiong, B Pang, H Wang, H Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

被引用次数：66 相关文章所有 2 个版本

[PDF] arxiv.org

Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding

X Li, Y Zhao, C Wang, G Scalia, G Eraslan… - arXiv preprint arXiv …, 2024 - arxiv.org

Diffusion models excel at capturing the natural design spaces of images, molecules, DNA,
RNA, and protein sequences. However, rather than merely generating designs that are …

被引用次数：7 相关文章所有 5 个版本

[PDF] arxiv.org

Variational best-of-n alignment

A Amini, T Vieira, R Cotterell - arXiv preprint arXiv:2407.06057, 2024 - arxiv.org

Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human
preferences. The algorithm works as follows: at inference time, N samples are drawn from …

被引用次数：10 相关文章所有 5 个版本

[PDF] arxiv.org

Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling

J Qiu, Y Lu, Y Zeng, J Guo, J Geng, H Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Inference-time alignment enhances the performance of large language models without
requiring additional training or fine-tuning but presents challenges due to balancing …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Inference-time language model alignment via integrated value guidance

Z Liu, Z Zhou, Y Wang, C Yang, Y Qiao - arXiv preprint arXiv:2409.17819, 2024 - arxiv.org

Large language models are typically fine-tuned to align with human preferences, but tuning
large models is computationally intensive and complex. In this work, we introduce $\textit …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Steering masked discrete diffusion models via discrete denoising posterior prediction

J Rector-Brooks, M Hasan, Z Peng, Z Quinn… - arXiv preprint arXiv …, 2024 - arxiv.org

Generative modeling of discrete data underlies important applications spanning text-based
agents like ChatGPT to the design of the very building blocks of life in protein sequences …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Asymptotics of language model alignment

JQ Yang, S Salamatian, Z Sun, AT Suresh… - arXiv preprint arXiv …, 2024 - arxiv.org

Let $ p $ denote a generative language model. Let $ r $ denote a reward model that returns
a scalar that captures the degree at which a draw from $ p $ is preferred. The goal of …

被引用次数：10 相关文章

高级搜索

QQ 群