Improving clip training with language rewrites

L Fan, D Krishnan, P Isola… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective
and scalable methods for training transferable vision models using paired image and text …

An inverse scaling law for clip training

X Li, Z Wang, C Xie - Advances in Neural Information …, 2024 - proceedings.neurips.cc
CLIP, one of the pioneering foundation models that connect images and text, has enabled
many recent breakthroughs in computer vision. However, its associated training cost is …

A closer look at the robustness of contrastive language-image pre-training (clip)

W Tu, W Deng, T Gedeon - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Abstract Contrastive Language-Image Pre-training (CLIP) models have demonstrated
remarkable generalization capabilities across multiple challenging distribution shifts …

Softclip: Softer cross-modal alignment makes clip stronger

Y Gao, J Liu, Z Xu, T Wu, E Zhang, K Li… - Proceedings of the …, 2024 - ojs.aaai.org
During the preceding biennium, vision-language pre-training has achieved noteworthy
success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs …

Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision

Y Cui, L Zhao, F Liang, Y Li, J Shao - arXiv preprint arXiv:2203.05796, 2022 - arxiv.org
Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn
visual models from language supervision. While researchers continue to push the frontier of …

Unsupervised prompt learning for vision-language models

T Huang, J Chu, F Wei - arXiv preprint arXiv:2204.03649, 2022 - arxiv.org
Contrastive vision-language models like CLIP have shown great progress in transfer
learning. In the inference stage, the proper text description, also known as prompt, needs to …

Attentive mask clip

Y Yang, W Huang, Y Wei, H Peng… - Proceedings of the …, 2023 - openaccess.thecvf.com
In vision-language modeling, image token removal is an efficient augmentation technique to
reduce the cost of encoding image features. The CLIP-style models, however, have been …

Alpha-clip: A clip model focusing on wherever you want

Z Sun, Y Fang, T Wu, P Zhang, Y Zang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) plays an essential role in
extracting valuable content information from images across diverse tasks. It aligns textual …

Scaling language-image pre-training via masking

Y Li, H Fan, R Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …

Eva-clip: Improved training techniques for clip at scale

Q Sun, Y Fang, L Wu, X Wang, Y Cao - arXiv preprint arXiv:2303.15389, 2023 - arxiv.org
Contrastive language-image pre-training, CLIP for short, has gained increasing attention for
its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models …