YF Zhang, W Yu, Q Wen, X Wang, Z Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
In the realms of computer vision and natural language processing, Large Vision-Language Models (LVLMs) have become indispensable tools, proficient in generating textual …
J Zhao, Z Yu, X Zhang, Y Yang - IEEE Access, 2023 - ieeexplore.ieee.org
Recent research has revealed the notorious language prior problem in visual question answering (VQA) tasks based on visual-textual interaction, which indicates that well …
Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities …
W An, F Tian, S Leng, J Nie, H Lin, QY Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the …
Z Yu, J Zhao, C Guo, Y Yang - IET Computer Vision, 2024 - Wiley Online Library
With the booming fields of computer vision and natural language processing, cross‐modal intersections such as visual question answering (VQA) have become very popular …
J Zhu, Y Liu, H Zhu, H Lin, Y Jiang, Z Zhang… - ACM Multimedia … - openreview.net
The challenge of bias in visual question answering (VQA) has gained considerable attention in contemporary research. Various intricate bias dependencies, such as modalities and data …
[14] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. In …