W Yu, Z Yang, S Lin, Q Zhao, J Wang, L Gui… - arXiv preprint arXiv …, 2024 - arxiv.org
In text-to-image (T2I) generation, a prevalent training technique involves utilizing Vision Language Models (VLMs) for image re-captioning. Even though VLMs are known to exhibit …