Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Sam-clip: Merging vision foundation models towards semantic and spatial understanding

H Wang, PKA Vasu, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com
The landscape of publicly available vision foundation models (VFMs) such as CLIP and
SAM is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their …

Mobileclip: Fast image-text models through multi-modal reinforced training

PKA Vasu, H Pouransari, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com
Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

J Chen, Q Yu, X Shen, A Yuille… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …

Data curation via joint example selection further accelerates multimodal learning

T Evans, N Parthasarathy, H Merzic… - arXiv preprint arXiv …, 2024 - arxiv.org
Data curation is an essential component of large-scale pretraining. In this work, we
demonstrate that jointly selecting batches of data is more effective for learning than selecting …

Rejuvenating image-gpt as strong visual representation learners

S Ren, Z Wang, H Zhu, J Xiao, A Yuille… - Forty-first International …, 2023 - openreview.net
This paper enhances image-GPT (iGPT), one of the pioneering works that introduce
autoregressive pretraining to predict the next pixels for visual representation learning. Two …

MoDE: CLIP Data Experts via Clustering

J Ma, PY Huang, S Xie, SW Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
The success of contrastive language-image pretraining (CLIP) relies on the supervision from
the pairing between images and captions which tends to be noisy in web-crawled data. We …

CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a\$10,000 Budget; An Extra\$4,000 Unlocks 81.8% Accuracy

X Li, Z Wang, C Xie - arXiv preprint arXiv:2306.15658, 2023 - arxiv.org
The recent work CLIPA presents an inverse scaling law for CLIP training--whereby the larger
the image/text encoders used, the shorter the sequence length of image/text tokens that can …

DiG-IN: Diffusion Guidance for Investigating Networks-Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations

M Augustin, Y Neuhaus, M Hein - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
While deep learning has led to huge progress in complex image classification tasks like
ImageNet unexpected failure modes eg via spurious features call into question how reliably …

Revisiting Adversarial Training at Scale

Z Wang, X Li, H Zhu, C Xie - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
The machine learning community has witnessed a drastic change in the training pipeline
pivoted by those" foundation models" with unprecedented scales. However the field of …