相关文章- 学术资源搜索

Scaling language-image pre-training via masking

Y Li, H Fan, R Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …

被引用次数：204 相关文章所有 6 个版本

[PDF] thecvf.com

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …

被引用次数：21 相关文章所有 6 个版本

[PDF] thecvf.com

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

CE Wu, Y Tian, H Yu, H Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision-language models such as CLIP learn a generic text-image embedding from large-
scale training data. A vision-language model can be adapted to a new classification task …

被引用次数：6 相关文章所有 5 个版本

[PDF] thecvf.com

Sigmoid loss for language image pre-training

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …

被引用次数：165 相关文章所有 5 个版本

[PDF] thecvf.com

Mixgen: A new multi-modal data augmentation

X Hao, Y Zhu, S Appalaraju, A Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-
language pre-training, data is only augmented either for images or for text in previous works …

被引用次数：64 相关文章所有 8 个版本

[PDF] thecvf.com

Cit: Curation in training for effective vision-language data

H Xu, S Xie, PY Huang, L Yu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large vision-language models are generally applicable to many downstream tasks, but
come at an exorbitant training cost that only large institutions can afford. This paper trades …

被引用次数：19 相关文章所有 5 个版本

[PDF] thecvf.com

Misalign, contrast then distill: Rethinking misalignments in language-image pre-training

B Kim, Y Jo, J Kim, S Kim - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pretraining has emerged as a prominent approach for
training vision and text encoders with uncurated image-text pairs from the web. To enhance …

被引用次数：3 相关文章所有 5 个版本

[PDF] neurips.cc

Improving clip training with language rewrites

L Fan, D Krishnan, P Isola… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective
and scalable methods for training transferable vision models using paired image and text …

被引用次数：68 相关文章所有 6 个版本

[PDF] thecvf.com

Attentive mask clip

Y Yang, W Huang, Y Wei, H Peng… - Proceedings of the …, 2023 - openaccess.thecvf.com

In vision-language modeling, image token removal is an efficient augmentation technique to
reduce the cost of encoding image features. The CLIP-style models, however, have been …

被引用次数：14 相关文章所有 5 个版本

[PDF] arxiv.org

Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision

Y Cui, L Zhao, F Liang, Y Li, J Shao - arXiv preprint arXiv:2203.05796, 2022 - arxiv.org

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn
visual models from language supervision. While researchers continue to push the frontier of …

被引用次数：34 相关文章所有 3 个版本

高级搜索

QQ 群

Scaling language-image pre-training via masking

Alip: Adaptive language-image pre-training with synthetic caption

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Sigmoid loss for language image pre-training

Mixgen: A new multi-modal data augmentation

Cit: Curation in training for effective vision-language data

Misalign, contrast then distill: Rethinking misalignments in language-image pre-training

Improving clip training with language rewrites

Attentive mask clip

Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision

引用