Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Maskclip: Masked self-distillation advances contrastive language-image pretraining

X Dong, J Bao, Y Zheng, T Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly
proposed masked self-distillation into contrastive language-image pretraining. The core idea …

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …

Learning visual representations via language-guided sampling

M El Banani, K Desai… - Proceedings of the ieee …, 2023 - openaccess.thecvf.com
Although an object may appear in numerous contexts, we often describe it in a limited
number of ways. Language allows us to abstract away visual variation to represent and …

Pyramidclip: Hierarchical feature alignment for vision-language model pretraining

Y Gao, J Liu, Z Xu, J Zhang, K Li… - Advances in neural …, 2022 - proceedings.neurips.cc
Large-scale vision-language pre-training has achieved promising results on downstream
tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from …

Mobileclip: Fast image-text models through multi-modal reinforced training

PKA Vasu, H Pouransari, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com
Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …

The unreasonable effectiveness of CLIP features for image captioning: an experimental analysis

M Barraco, M Cornia, S Cascianelli… - proceedings of the …, 2022 - openaccess.thecvf.com
Generating textual descriptions from visual inputs is a fundamental step towards machine
intelligence, as it entails modeling the connections between the visual and textual …

Perceptual grouping in contrastive vision-language models

K Ranasinghe, B McKinzie, S Ravi… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent advances in zero-shot image recognition suggest that vision-language models learn
generic visual representations with a high degree of semantic information that may be …

Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens

Y Chen, J Yuan, Y Tian, S Geng, X Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based vision-language pre-training approaches, such as CLIP, have
demonstrated great success in many vision-language tasks. These methods achieve cross …

HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention

S Geng, J Yuan, Y Tian, Y Chen, Y Zhang - arXiv preprint arXiv …, 2023 - arxiv.org
The success of large-scale contrastive vision-language pretraining (CLIP) has benefited
both visual recognition and multimodal content understanding. The concise design brings …