T-mars: Improving visual representations by circumventing text feature learning

P Maini, S Goyal, ZC Lipton, JZ Kolter… - arXiv preprint arXiv …, 2023 - arxiv.org
Large web-sourced multimodal datasets have powered a slew of new methods for learning
general-purpose visual representations, advancing the state of the art in computer vision …

Laion-5b: An open large-scale dataset for training next generation image-text models

C Schuhmann, R Beaumont, R Vencu… - Advances in …, 2022 - proceedings.neurips.cc
Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …

Map: Multimodal uncertainty-aware vision-language pre-training model

Y Ji, J Wang, Y Gong, L Zhang, Y Zhu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multimodal semantic understanding often has to deal with uncertainty, which means the
obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our …

Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training

B Shan, W Yin, Y Sun, H Tian, H Wu… - arXiv preprint arXiv …, 2022 - arxiv.org
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted
extensive attention from academia and industry due to their superior performance on various …

HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention

S Geng, J Yuan, Y Tian, Y Chen, Y Zhang - arXiv preprint arXiv …, 2023 - arxiv.org
The success of large-scale contrastive vision-language pretraining (CLIP) has benefited
both visual recognition and multimodal content understanding. The concise design brings …

Exploring visual interpretability for contrastive language-image pre-training

Y Li, H Wang, Y Duan, H Xu, X Li - arXiv preprint arXiv:2209.07046, 2022 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily
available supervision of natural language. It improves the performance of downstream vision …

Seeing what you miss: Vision-language pre-training with semantic completion learning

Y Ji, R Tu, J Jiang, W Kong, C Cai… - Proceedings of the …, 2023 - openaccess.thecvf.com
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn
the correct corresponding information across different modalities. For this purpose, inspired …

Demystifying clip data

H Xu, S Xie, XE Tan, PY Huang, R Howes… - arXiv preprint arXiv …, 2023 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced
research and applications in computer vision, fueling modern recognition systems and …

Cross-lingual cross-modal pretraining for multimodal retrieval

H Fei, T Yu, P Li - Proceedings of the 2021 Conference of the …, 2021 - aclanthology.org
Recent pretrained vision-language models have achieved impressive performance on cross-
modal retrieval tasks in English. Their success, however, heavily depends on the availability …

Eva-clip-18b: Scaling clip to 18 billion parameters

Q Sun, J Wang, Q Yu, Y Cui, F Zhang, X Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both
vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful …