相关文章- 学术资源搜索

Multi-modal representation learning with text-driven soft masks

J Park, B Han - Proceedings of the IEEE/CVF Conference …, 2023 - openaccess.thecvf.com

We propose a visual-linguistic representation learning approach within a self-supervised
learning framework by introducing a new operation, loss, and data augmentation strategy …

被引用次数：5 相关文章所有 7 个版本

[PDF] arxiv.org

Masked vision and language modeling for multi-modal representation learning

G Kwon, Z Cai, A Ravichandran, E Bas… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we study how to use masked signal modeling in vision and language (V+ L)
representation learning. Instead of developing masked language modeling (MLM) and …

被引用次数：57 相关文章所有 6 个版本

[PDF] thecvf.com

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B Xie, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

被引用次数：504 相关文章所有 5 个版本

[PDF] arxiv.org

Mamo: masked multimodal modeling for fine-grained vision-language representation learning

Z Zhao, L Guo, X He, S Shao, Z Yuan, J Liu - arXiv preprint arXiv …, 2022 - arxiv.org

Multimodal representation learning has shown promising improvements on various vision-
language tasks. Most existing methods excel at building global-level alignment between …

被引用次数：4 相关文章所有 2 个版本

[PDF] thecvf.com

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

被引用次数：253 相关文章所有 8 个版本

[PDF] arxiv.org

SemVLP: Vision-language pre-training by aligning semantics at multiple levels

C Li, M Yan, H Xu, F Luo, W Wang, B Bi… - arXiv preprint arXiv …, 2021 - arxiv.org

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed
rapid progress for learning cross-modal representations. Existing pre-training methods …

被引用次数：22 相关文章所有 3 个版本

[PDF] arxiv.org

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

X Li, X Yin, C Li, P Zhang, X Hu, L Zhang… - Computer Vision–ECCV …, 2020 - Springer

Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …

被引用次数：1972 相关文章所有 6 个版本

[PDF] thecvf.com

Relaxing contrastiveness in multimodal representation learning

Z Lin, E Bas, KY Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

Multimodal representation learning for images with paired raw texts can improve the
usability and generality of the learned semantic concepts while significantly reducing …

被引用次数：5 相关文章所有 6 个版本

[PDF] arxiv.org

Multimodal masked autoencoders learn transferable representations

X Geng, H Liu, L Lee, D Schuurmans, S Levine… - arXiv preprint arXiv …, 2022 - arxiv.org

Building scalable models to learn from diverse, multimodal data remains an open challenge.
For vision-language data, the dominant approaches are based on contrastive learning …

被引用次数：98 相关文章所有 4 个版本

[PDF] arxiv.org

Improved baselines for vision-language pre-training

E Fini, P Astolfi, A Romero-Soriano, J Verbeek… - arXiv preprint arXiv …, 2023 - arxiv.org

Contrastive learning has emerged as an efficient framework to learn multimodal
representations. CLIP, a seminal work in this area, achieved impressive results by training …

被引用次数：14 相关文章所有 3 个版本

高级搜索

QQ 群

Multi-modal representation learning with text-driven soft masks

Masked vision and language modeling for multi-modal representation learning

Eva: Exploring the limits of masked visual representation learning at scale

Mamo: masked multimodal modeling for fine-grained vision-language representation learning

Vision-language pre-training with triple contrastive learning

SemVLP: Vision-language pre-training by aligning semantics at multiple levels

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Relaxing contrastiveness in multimodal representation learning

Multimodal masked autoencoders learn transferable representations

Improved baselines for vision-language pre-training

相关搜索

引用