Scaling language-image pre-training via masking

Y Li, H Fan, R Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

CE Wu, Y Tian, H Yu, H Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision-language models such as CLIP learn a generic text-image embedding from large-
scale training data. A vision-language model can be adapted to a new classification task …

Sigmoid loss for language image pre-training

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …

Mixgen: A new multi-modal data augmentation

X Hao, Y Zhu, S Appalaraju, A Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-
language pre-training, data is only augmented either for images or for text in previous works …

Cit: Curation in training for effective vision-language data

H Xu, S Xie, PY Huang, L Yu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large vision-language models are generally applicable to many downstream tasks, but
come at an exorbitant training cost that only large institutions can afford. This paper trades …

Misalign, contrast then distill: Rethinking misalignments in language-image pre-training

B Kim, Y Jo, J Kim, S Kim - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pretraining has emerged as a prominent approach for
training vision and text encoders with uncurated image-text pairs from the web. To enhance …

Improving clip training with language rewrites

L Fan, D Krishnan, P Isola… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective
and scalable methods for training transferable vision models using paired image and text …

Attentive mask clip

Y Yang, W Huang, Y Wei, H Peng… - Proceedings of the …, 2023 - openaccess.thecvf.com
In vision-language modeling, image token removal is an efficient augmentation technique to
reduce the cost of encoding image features. The CLIP-style models, however, have been …

Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision

Y Cui, L Zhao, F Liang, Y Li, J Shao - arXiv preprint arXiv:2203.05796, 2022 - arxiv.org
Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn
visual models from language supervision. While researchers continue to push the frontier of …