J Wang, Y Zhou, X Zhang, M Bao,
P Yan - arXiv preprint arXiv:2409.11212, 2024 - arxiv.org
Iterative preference optimization has recently become one of the de-facto training paradigms
for large language models (LLMs), but the performance is still underwhelming due to too …