When and why vision-language models behave like bags-of-words, and what to do about it?

M Yuksekgonul, F Bianchi, P Kalluri, D Jurafsky… - arXiv preprint arXiv …, 2022 - arxiv.org
Despite the success of large vision and language models (VLMs) in many downstream
applications, it is unclear how well they encode compositional information. Here, we create …

Aligning bag of regions for open-vocabulary object detection

S Wu, W Zhang, S Jin, W Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Pre-trained vision-language models (VLMs) learn to align vision and language
representations on large-scale datasets, where each image-text pair usually contains a bag …

Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning

H Bansal, N Singhi, Y Yang, F Yin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multimodal contrastive pretraining has been used to train multimodal representation models,
such as CLIP, on large amounts of paired image-text data. However, previous studies have …

Few-shot adaptation of multi-modal foundation models: A survey

F Liu, T Zhang, W Dai, C Zhang, W Cai, X Zhou… - Artificial Intelligence …, 2024 - Springer
Abstract Multi-modal (vision-language) models, such as CLIP, are replacing traditional
supervised pre-training models (eg, ImageNet-based pre-training) as the new generation of …

Detecting and understanding harmful memes: A survey

S Sharma, F Alam, MS Akhtar, D Dimitrov… - arXiv preprint arXiv …, 2022 - arxiv.org
The automatic identification of harmful content online is of major concern for social media
platforms, policymakers, and society. Researchers have studied textual, visual, and audio …

Cyclip: Cyclic contrastive language-image pretraining

S Goel, H Bansal, S Bhatia, R Rossi… - Advances in …, 2022 - proceedings.neurips.cc
Recent advances in contrastive representation learning over paired image-text data have
led to models such as CLIP that achieve state-of-the-art performance for zero-shot …

Revisiting the role of language priors in vision-language models

Z Lin, X Chen, D Pathak, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Vision-language models (VLMs) are impactful in part because they can be applied to a
variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We …

Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Y Zeng, Y Huang, J Zhang, Z Jie… - Proceedings of the …, 2024 - openaccess.thecvf.com
Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

A Bulat, Y Ouali… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Despite noise and caption quality having been acknowledged as important factors impacting
vision-language contrastive pre-training in this paper we show that the full potential of …

Idea: Increasing text diversity via online multi-label recognition for vision-language pre-training

X Huang, Y Zhang, Y Cheng, W Tian, R Zhao… - Proceedings of the 30th …, 2022 - dl.acm.org
Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated
superior performance in various fields. However, the image-text pairs co-occurrent on the …