Data-Centric Human Preference Optimization with Rationales

HA Just, M Jin, A Sahu, H Phan, R Jia - arXiv preprint arXiv:2407.14477, 2024 - arxiv.org
Reinforcement learning from human feedback plays a crucial role in aligning language
models towards human preferences, traditionally represented through comparisons …