Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Sigmoid loss for language image pre-training

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

J Chen, Q Yu, X Shen, A Yuille… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …

Mind the modality gap: Towards a remote sensing vision-language model via cross-modal alignment

A Zavras, D Michail, B Demir, I Papoutsis - arXiv preprint arXiv:2402.09816, 2024 - arxiv.org
Deep Learning (DL) is undergoing a paradigm shift with the emergence of foundation
models, aptly named by their crucial, yet incomplete nature. In this work, we focus on …

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

P Allgeuer, K Ahrens, S Wermter - arXiv preprint arXiv:2407.11211, 2024 - arxiv.org
We introduce NOVIC, an innovative uNconstrained Open Vocabulary Image Classifier that
uses an autoregressive transformer to generatively output classification labels as language …

Zooming out on zooming in: Advancing super-resolution for remote sensing

P Wolters, F Bastani, A Kembhavi - arXiv preprint arXiv:2311.18082, 2023 - arxiv.org
Super-Resolution for remote sensing has the potential for huge impact on planet monitoring
by producing accurate and realistic high resolution imagery on a frequent basis and a global …

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

S Schrodi, DT Hoffmann, M Argus, V Fischer… - arXiv preprint arXiv …, 2024 - arxiv.org
Contrastive vision-language models like CLIP have gained popularity for their versatile
applicable learned representations in various downstream tasks. Despite their successes in …

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Y Liu, X Li, Z Wang, B Zhao, C Xie - arXiv preprint arXiv:2411.16828, 2024 - arxiv.org
Previous works show that noisy, web-crawled image-text pairs may limit vision-language
pretraining like CLIP and propose learning with synthetic captions as a promising …

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Y Liu, T Ji, C Sun, Y Wu, A Zhou - arXiv preprint arXiv:2410.03176, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have achieved impressive performance, yet
research has pointed out a serious issue with object hallucinations within these models …

What If We Recaption Billions of Web Images with LLaMA-3?

X Li, H Tu, M Hui, Z Wang, B Zhao, J Xiao… - arXiv preprint arXiv …, 2024 - arxiv.org
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that
semantically aligning and enriching textual descriptions of these pairs can significantly …