Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang, HT Shen - arXiv preprint arXiv …, 2023 - arxiv.org
With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval
methods struggle to meet the needs of users seeking access to data across various …

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

J Ye, H Xu, H Liu, A Hu, M Yan, Q Qian, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities
in executing instructions for a variety of single-image tasks. Despite this progress, significant …

Grounding language models for visual entity recognition

Z Xiao, M Gong, P Cascante-Bonilla, X Zhang… - … on Computer Vision, 2025 - Springer
Abstract We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our
model extends an autoregressive Multimodal Large Language Model by employing retrieval …

EchoSight: Advancing Visual-Language Models with Wiki Knowledge

Y Yan, W Xie - arXiv preprint arXiv:2407.12735, 2024 - arxiv.org
Knowledge-based Visual Question Answering (KVQA) tasks require answering questions
about images using extensive background knowledge. Despite significant advancements …

Vacode: Visual augmented contrastive decoding

S Kim, B Cho, S Bae, S Ahn, SY Yun - arXiv preprint arXiv:2408.05337, 2024 - arxiv.org
Despite the astonishing performance of recent Large Vision-Language Models (LVLMs),
these models often generate inaccurate responses. To address this issue, previous studies …

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

H Wang, W Ge - European Conference on Computer Vision, 2025 - Springer
With the breakthrough of multi-modal large language models (MLLMs), answering complex
visual questions that demand advanced reasoning abilities and world knowledge has …

Taming CLIP for Fine-Grained and Structured Visual Understanding of Museum Exhibits

AA Balauca, DP Paudel, K Toutanova… - European Conference on …, 2025 - Springer
CLIP is a powerful and widely used tool for understanding images in the context of natural
language descriptions to perform nuanced tasks. However, it does not offer application …

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

D Hao, Q Wang, L Guo, J Jiang… - Proceedings of the 2024 …, 2024 - aclanthology.org
While large pre-trained visual-language models have shown promising results on traditional
visual question answering benchmarks, it is still challenging for them to answer complex …

Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa

P Jian, D Yu, J Zhang - Proceedings of the 2024 Conference on …, 2024 - aclanthology.org
Visual question answering (VQA) tasks, often performed by visual language model (VLM),
face challenges with long-tail knowledge. Recent retrieval-augmented VQA (RA-VQA) …

Unified Generative and Discriminative Training for Multi-modal Large Language Models

W Chow, J Li, Q Yu, K Pan, H Fei, Z Ge, S Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent times, Vision-Language Models (VLMs) have been trained under two predominant
paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) …