Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and...

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

被引用次数：182 相关文章所有 8 个版本

[PDF] thecvf.com

Maskclip: Masked self-distillation advances contrastive language-image pretraining

X Dong, J Bao, Y Zheng, T Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly
proposed masked self-distillation into contrastive language-image pretraining. The core idea …

被引用次数：138 相关文章所有 10 个版本

[PDF] thecvf.com

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …

被引用次数：45 相关文章所有 6 个版本

[PDF] thecvf.com

Learning visual representations via language-guided sampling

M El Banani, K Desai… - Proceedings of the ieee …, 2023 - openaccess.thecvf.com

Although an object may appear in numerous contexts, we often describe it in a limited
number of ways. Language allows us to abstract away visual variation to represent and …

被引用次数：32 相关文章所有 7 个版本

[PDF] neurips.cc

Pyramidclip: Hierarchical feature alignment for vision-language model pretraining

Y Gao, J Liu, Z Xu, J Zhang, K Li… - Advances in neural …, 2022 - proceedings.neurips.cc

Large-scale vision-language pre-training has achieved promising results on downstream
tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from …

被引用次数：97 相关文章所有 5 个版本

[PDF] thecvf.com

Mobileclip: Fast image-text models through multi-modal reinforced training

PKA Vasu, H Pouransari, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com

Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …

被引用次数：29 相关文章所有 2 个版本

[PDF] thecvf.com

The unreasonable effectiveness of CLIP features for image captioning: an experimental analysis

M Barraco, M Cornia, S Cascianelli… - proceedings of the …, 2022 - openaccess.thecvf.com

Generating textual descriptions from visual inputs is a fundamental step towards machine
intelligence, as it entails modeling the connections between the visual and textual …

被引用次数：83 相关文章所有 6 个版本

[PDF] thecvf.com

Perceptual grouping in contrastive vision-language models

K Ranasinghe, B McKinzie, S Ravi… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recent advances in zero-shot image recognition suggest that vision-language models learn
generic visual representations with a high degree of semantic information that may be …

被引用次数：47 相关文章所有 6 个版本

[PDF] thecvf.com

Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens

Y Chen, J Yuan, Y Tian, S Geng, X Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Contrastive learning-based vision-language pre-training approaches, such as CLIP, have
demonstrated great success in many vision-language tasks. These methods achieve cross …

被引用次数：38 相关文章所有 7 个版本

[PDF] arxiv.org

HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention

S Geng, J Yuan, Y Tian, Y Chen, Y Zhang - arXiv preprint arXiv …, 2023 - arxiv.org

The success of large-scale contrastive vision-language pretraining (CLIP) has benefited
both visual recognition and multimodal content understanding. The concise design brings …

被引用次数：47 相关文章所有 3 个版本

高级搜索

QQ 群