A survey on data selection for language models

A Albalak, Y Elazar, SM Xie, S Longpre… - arXiv preprint arXiv …, 2024 - arxiv.org
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

Controlled decoding from language models

S Mudgal, J Lee, H Ganapathy, YG Li, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose controlled decoding (CD), a novel off-policy reinforcement learning method to
control the autoregressive generation from language models towards high reward …

Rlhf workflow: From reward modeling to online rlhf

H Dong, W Xiong, B Pang, H Wang, H Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding

X Li, Y Zhao, C Wang, G Scalia, G Eraslan… - arXiv preprint arXiv …, 2024 - arxiv.org
Diffusion models excel at capturing the natural design spaces of images, molecules, DNA,
RNA, and protein sequences. However, rather than merely generating designs that are …

Variational best-of-n alignment

A Amini, T Vieira, R Cotterell - arXiv preprint arXiv:2407.06057, 2024 - arxiv.org
Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human
preferences. The algorithm works as follows: at inference time, N samples are drawn from …

Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling

J Qiu, Y Lu, Y Zeng, J Guo, J Geng, H Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Inference-time alignment enhances the performance of large language models without
requiring additional training or fine-tuning but presents challenges due to balancing …

Inference-time language model alignment via integrated value guidance

Z Liu, Z Zhou, Y Wang, C Yang, Y Qiao - arXiv preprint arXiv:2409.17819, 2024 - arxiv.org
Large language models are typically fine-tuned to align with human preferences, but tuning
large models is computationally intensive and complex. In this work, we introduce $\textit …

Steering masked discrete diffusion models via discrete denoising posterior prediction

J Rector-Brooks, M Hasan, Z Peng, Z Quinn… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative modeling of discrete data underlies important applications spanning text-based
agents like ChatGPT to the design of the very building blocks of life in protein sequences …

Asymptotics of language model alignment

JQ Yang, S Salamatian, Z Sun, AT Suresh… - arXiv preprint arXiv …, 2024 - arxiv.org
Let $ p $ denote a generative language model. Let $ r $ denote a reward model that returns
a scalar that captures the degree at which a draw from $ p $ is preferred. The goal of …