Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant …
Abstract We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multimodal Large Language Model by employing retrieval …
Despite the astonishing performance of recent Large Vision-Language Models (LVLMs), these models often generate inaccurate responses. To address this issue, previous studies …
H Wang, W Ge - European Conference on Computer Vision, 2025 - Springer
With the breakthrough of multi-modal large language models (MLLMs), answering complex visual questions that demand advanced reasoning abilities and world knowledge has …
CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application …
D Hao, Q Wang, L Guo, J Jiang… - Proceedings of the 2024 …, 2024 - aclanthology.org
While large pre-trained visual-language models have shown promising results on traditional visual question answering benchmarks, it is still challenging for them to answer complex …
P Jian, D Yu, J Zhang - Proceedings of the 2024 Conference on …, 2024 - aclanthology.org
Visual question answering (VQA) tasks, often performed by visual language model (VLM), face challenges with long-tail knowledge. Recent retrieval-augmented VQA (RA-VQA) …
In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) …