Enhancing multimodal compositional reasoning of visual language models with generative negative mining

U Sahin, H Li, Q Khan, D Cremers… - Proceedings of the …, 2024 - openaccess.thecvf.com
Contemporary large-scale visual language models (VLMs) exhibit strong representation
capacities, making them ubiquitous for enhancing the image and text understanding tasks …

Revisiting the role of language priors in vision-language models

Z Lin, X Chen, D Pathak, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Vision-language models (VLMs) are impactful in part because they can be applied to a
variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We …

Building Vision-Language Models on Solid Foundations with Masked Distillation

S Sameni, K Kafle, H Tan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Recent advancements in Vision-Language Models (VLMs) have marked a
significant leap in bridging the gap between computer vision and natural language …

GeoMeter: Probing Depth and Height Perception of Large Visual-Language Models

S Azad, Y Jain, R Garg, YS Rawat, V Vineet - arXiv preprint arXiv …, 2024 - arxiv.org
Geometric understanding is crucial for navigating and interacting with our environment.
While large Vision Language Models (VLMs) demonstrate impressive capabilities …

Concept-Oriented Deep Learning with Large Language Models

DT Chang - arXiv preprint arXiv:2306.17089, 2023 - arxiv.org
Large Language Models (LLMs) have been successfully used in many natural-language
tasks and applications including text generation and AI chatbots. They also are a promising …

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

SH Dumpala, A Jaiswal, C Sastry, E Milios… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite their remarkable successes, state-of-the-art language models face challenges in
grasping certain important semantic details. This paper introduces the VISLA (Variance and …

SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations

SH Dumpala, A Jaiswal, C Sastry, E Milios… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite their remarkable successes, state-of-the-art large language models (LLMs),
including vision-and-language models (VLMs) and unimodal language models (ULMs), fail …

BloomVQA: Assessing Hierarchical Multi-modal Comprehension

Y Gong, R Shrestha, J Claypoole, M Cogswell… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose a novel VQA dataset, BloomVQA, to facilitate comprehensive evaluation of
large vision-language models on comprehension tasks. Unlike current benchmarks that …

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

R Esfandiarpoor, C Menghini, SH Bach - arXiv preprint arXiv:2403.16442, 2024 - arxiv.org
Recent works often assume that Vision-Language Model (VLM) representations are based
on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this …

Zero-shot capabilities of visual language models with prompt engineering for images of animals

AT Ocampo, E Orenstein, K Young - I Can't Believe It's Not Better Workshop … - openreview.net
Visual Language Models have exhibited impressive performance on new tasks in a zero-
shot setting. Language queries enable these large models to classify or detect objects even …