Multi-modal representation learning with text-driven soft masks

J Park, B Han - Proceedings of the IEEE/CVF Conference …, 2023 - openaccess.thecvf.com
We propose a visual-linguistic representation learning approach within a self-supervised
learning framework by introducing a new operation, loss, and data augmentation strategy …

Masked vision and language modeling for multi-modal representation learning

G Kwon, Z Cai, A Ravichandran, E Bas… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we study how to use masked signal modeling in vision and language (V+ L)
representation learning. Instead of developing masked language modeling (MLM) and …

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B Xie, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

Mamo: masked multimodal modeling for fine-grained vision-language representation learning

Z Zhao, L Guo, X He, S Shao, Z Yuan, J Liu - arXiv preprint arXiv …, 2022 - arxiv.org
Multimodal representation learning has shown promising improvements on various vision-
language tasks. Most existing methods excel at building global-level alignment between …

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

SemVLP: Vision-language pre-training by aligning semantics at multiple levels

C Li, M Yan, H Xu, F Luo, W Wang, B Bi… - arXiv preprint arXiv …, 2021 - arxiv.org
Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed
rapid progress for learning cross-modal representations. Existing pre-training methods …

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

X Li, X Yin, C Li, P Zhang, X Hu, L Zhang… - Computer Vision–ECCV …, 2020 - Springer
Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …

Relaxing contrastiveness in multimodal representation learning

Z Lin, E Bas, KY Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multimodal representation learning for images with paired raw texts can improve the
usability and generality of the learned semantic concepts while significantly reducing …

Multimodal masked autoencoders learn transferable representations

X Geng, H Liu, L Lee, D Schuurmans, S Levine… - arXiv preprint arXiv …, 2022 - arxiv.org
Building scalable models to learn from diverse, multimodal data remains an open challenge.
For vision-language data, the dominant approaches are based on contrastive learning …

Improved baselines for vision-language pre-training

E Fini, P Astolfi, A Romero-Soriano, J Verbeek… - arXiv preprint arXiv …, 2023 - arxiv.org
Contrastive learning has emerged as an efficient framework to learn multimodal
representations. CLIP, a seminal work in this area, achieved impressive results by training …