MAFA: Managing False Negatives for Vision-Language Pre-training

J Byun, D Kim, T Moon - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
We consider a critical issue of false negatives in Vision-Language Pre-training (VLP) a
challenge that arises from the inherent many-to-many correspondence of image-text pairs in …

Converting and Smoothing False Negatives for Vision-Language Pre-training

J Byun, D Kim, T Moon - arXiv preprint arXiv:2312.06112, 2023 - arxiv.org
We consider the critical issue of false negatives in Vision-Language Pre-training (VLP), a
challenge that arises from the inherent many-to-many correspondence of image-text pairs in …

Leveraging per image-token consistency for vision-language pre-training

Y Gou, T Ko, H Yang, J Kwok… - Proceedings of the …, 2023 - openaccess.thecvf.com
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked
language modeling (CMLM) to learn vision-language associations. However, we find that …

Accelerating vision-language pretraining with free language modeling

T Wang, Y Ge, F Zheng, R Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com
The state of the arts in vision-language pretraining (VLP) achieves exemplary performance
but suffers from high training costs resulting from slow convergence and long training time …

Pyramidclip: Hierarchical feature alignment for vision-language model pretraining

Y Gao, J Liu, Z Xu, J Zhang, K Li… - Advances in neural …, 2022 - proceedings.neurips.cc
Large-scale vision-language pre-training has achieved promising results on downstream
tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from …

Filtering, distillation, and hard negatives for vision-language pre-training

F Radenovic, A Dubey, A Kadian… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision-language models trained with contrastive learning on large-scale noisy data are
becoming increasingly popular for zero-shot recognition problems. In this paper we improve …

SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training

S Wu, H Tan, Z Tian, Y Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language pre-training (VLP) aims to learn joint representations of vision and
language modalities. The contrastive paradigm is currently dominant in this field. However …

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

C Jiang, W Ye, H Xu, Q Ye, M Yan, J Zhang… - Proceedings of the …, 2024 - ojs.aaai.org
Abstract Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances
modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic …

VLMAE: Vision-language masked autoencoder

S He, T Guo, T Dai, R Qiao, C Wu, X Shu… - arXiv preprint arXiv …, 2022 - arxiv.org
Image and language modeling is of crucial importance for vision-language pre-training
(VLP), which aims to learn multi-modal representations from large-scale paired image-text …

Cmal: A novel cross-modal associative learning framework for vision-language pre-training

Z Ma, J Li, G Li, K Huang - Proceedings of the 30th ACM International …, 2022 - dl.acm.org
With the flourishing of social media platforms, vision-language pre-training (VLP) recently
has received great attention and many remarkable progresses have been achieved. The …