From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Large language models and causal inference in collaboration: A comprehensive survey

X Liu, P Xu, J Wu, J Yuan, Y Yang, Y Zhou, F Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Causal inference has shown potential in enhancing the predictive accuracy, fairness,
robustness, and explainability of Natural Language Processing (NLP) models by capturing …

Aligning large multi-modal model with robust instruction tuning

F Liu, K Lin, L Li, J Wang, Y Yacoob, L Wang - arXiv preprint arXiv …, 2023 - arxiv.org
Despite the promising progress in multi-modal tasks, current large multi-modal models
(LMM) are prone to hallucinating inconsistent descriptions with respect to the associated …

HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

T Guan, F Liu, X Wu, R Xian, Z Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce" HallusionBench" a comprehensive benchmark designed for the evaluation of
image-context reasoning. This benchmark presents significant challenges to advanced large …

[PDF][PDF] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

T Guan, F Liu, X Wu, R Xian, Z Li, X Liu… - arXiv preprint arXiv …, 2023 - researchgate.net
Large language models (LLMs), after being aligned with vision models and integrated into
vision-language models (VLMs), can bring impressive improvement in image reasoning …

[PDF][PDF] Mitigating hallucination in large multi-modal models via robust instruction tuning

F Liu, K Lin, L Li, J Wang, Y Yacoob… - The Twelfth International …, 2023 - researchgate.net
Despite the promising progress in multi-modal tasks, current large multi-modal models
(LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated …

Detecting and grounding multi-modal media manipulation

R Shao, T Wu, Z Liu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Misinformation has become a pressing issue. Fake media, in both visual and textual forms,
is widespread on the web. While various deepfake detection and text fake news detection …

Uniir: Training and benchmarking universal multimodal information retrievers

C Wei, Y Chen, H Chen, H Hu, G Zhang, J Fu… - … on Computer Vision, 2025 - Springer
Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …

Mmc: Advancing multimodal chart understanding with large-scale instruction tuning

F Liu, X Wang, W Yao, J Chen, K Song, S Cho… - arXiv preprint arXiv …, 2023 - arxiv.org
With the rapid development of large language models (LLMs) and their integration into large
multimodal models (LMMs), there has been impressive progress in zero-shot completion of …

Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

H Hu, Y Luan, Y Chen, U Khandelwal… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong
generalization on various visual domains and tasks. However, existing image classification …