Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Unified hallucination detection for multimodal large language models

X Chen, C Wang, Y Xue, N Zhang, X Yang, Q Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite significant strides in multimodal tasks, Multimodal Large Language Models (MLLMs)
are plagued by the critical issue of hallucination. The reliable detection of such …

A comprehensive survey of hallucination in large language, image, video and audio foundation models

P Sahoo, P Meharia, A Ghosh, S Saha… - Findings of the …, 2024 - aclanthology.org
The rapid advancement of foundation models (FMs) across language, image, audio, and
video domains has shown remarkable capabilities in diverse tasks. However, the …

Hallucination of multimodal large language models: A survey

Z Bai, P Wang, T Xiao, T He, Z Han, Z Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
This survey presents a comprehensive analysis of the phenomenon of hallucination in
multimodal large language models (MLLMs), also known as Large Vision-Language Models …

Rlaif-v: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness

T Yu, H Zhang, Y Yao, Y Dang, D Chen, X Lu… - arXiv preprint arXiv …, 2024 - arxiv.org
Learning from feedback reduces the hallucination of multimodal large language models
(MLLMs) by aligning them with human preferences. While traditional methods rely on labor …

Unimel: A unified framework for multimodal entity linking with large language models

Q Liu, Y He, T Xu, D Lian, C Liu, Z Zheng… - Proceedings of the 33rd …, 2024 - dl.acm.org
Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions
within multimodal contexts to the referent entities in a multimodal knowledge base, such as …

On Erroneous Agreements of CLIP Image Embeddings

S Li, PW Koh, SS Du - arXiv preprint arXiv:2411.05195, 2024 - arxiv.org
Recent research suggests that the failures of Vision-Language Models (VLMs) at visual
reasoning often stem from erroneous agreements--when semantically distinct images are …

Game on Tree: Visual Hallucination Mitigation via Coarse-to-Fine View Tree and Game Theory

X Zhuang, Z Zhu, Z Chen, Y Xie, L Liang… - Proceedings of the …, 2024 - aclanthology.org
Abstract Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to
reality, also known as visual hallucinations (VH), which hinders their application in …

Mitigating Object Hallucination via Data Augmented Contrastive Tuning

P Sarkar, S Ebrahimi, A Etemad, A Beirami… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite their remarkable progress, Multimodal Large Language Models (MLLMs) tend to
hallucinate factually inaccurate information. In this work, we address object hallucinations in …

Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment

C Cui, A Zhang, Y Zhou, Z Chen, G Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
The recent advancements in large language models (LLMs) and pre-trained vision models
have accelerated the development of vision-language large models (VLLMs), enhancing the …