A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J Xiao, L Chen - arXiv preprint arXiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

How and where does CLIP process negation?

V Quantmeyer, P Mosteiro, A Gatt - arXiv preprint arXiv:2407.10488, 2024 - arxiv.org
Various benchmarks have been proposed to test linguistic understanding in pre-trained
vision\& language (VL) models. Here we build on the existence task from the VALSE …

Vlm-ad: End-to-end autonomous driving through vision-language model supervision

Y Xu, Y Hu, Z Zhang, GP Meyer, SK Mustikovela… - arXiv preprint arXiv …, 2024 - arxiv.org
Human drivers rely on commonsense reasoning to navigate diverse and dynamic real-world
scenarios. Existing end-to-end (E2E) autonomous driving (AD) models are typically …

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

T Nguyen, Y Bin, J Xiao, L Qu, Y Li, JZ Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Humans use multiple senses to comprehend the environment. Vision and language are two
of the most vital senses since they allow us to easily communicate our thoughts and …

CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding

I Beňová, M Gregor, A Gatt - arXiv preprint arXiv:2409.01389, 2024 - arxiv.org
This study investigates the ability of various vision-language (VL) models to ground context-
dependent and non-context-dependent verb phrases. To do that, we introduce the CV …