Y Qi, H Li, Y Song, X Wu, J Luo - arXiv preprint arXiv:2412.08158, 2024 - arxiv.org
The exploration of various vision-language tasks, such as visual captioning, visual question
answering, and visual commonsense reasoning, is an important area in artificial intelligence …