相关文章- 学术资源搜索

Uniter: Universal image-text representation learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - European conference on …, 2020 - Springer

Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

被引用次数：1998 相关文章所有 7 个版本

[PDF] openreview.net

Uniter: Learning universal image-text representations

YC Chen, L Li, L Yu, A El Kholy, F Ahmed, Z Gan… - 2019 - openreview.net

Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are jointly processed for visual and textual understanding. In this …

被引用次数：401 相关文章所有 2 个版本

[PDF] arxiv.org

Ufo: A unified transformer for vision-language representation learning

J Wang, X Hu, Z Gan, Z Yang, X Dai, Z Liu, Y Lu… - arXiv preprint arXiv …, 2021 - arxiv.org

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of
processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …

被引用次数：57 相关文章所有 2 个版本

[PDF] aaai.org

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

G Li, N Duan, Y Fang, M Gong, D Jiang - Proceedings of the AAAI …, 2020 - aaai.org

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of
vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …

被引用次数：883 相关文章所有 12 个版本

[PDF] thecvf.com

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

M Zhou, L Zhou, S Wang, Y Cheng… - Proceedings of the …, 2021 - openaccess.thecvf.com

Vision-and-language pre-training has achieved impressive success in learning multimodal
representations between vision and language. To generalize this success to non-English …

被引用次数：79 相关文章所有 9 个版本

[PDF] arxiv.org

Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data

D Qi, L Su, J Song, E Cui, T Bharti, A Sacheti - arXiv preprint arXiv …, 2020 - arxiv.org

In this paper, we introduce a new vision-language pre-trained model--ImageBERT--for
image-text joint embedding. Our model is a Transformer-based model, which takes different …

被引用次数：264 相关文章所有 2 个版本

[PDF] arxiv.org

Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

Z Huang, Z Zeng, B Liu, D Fu, J Fu - arXiv preprint arXiv:2004.00849, 2020 - arxiv.org

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers
that jointly learn visual and language embedding in a unified end-to-end framework. We aim …

被引用次数：418 相关文章所有 3 个版本

[PDF] arxiv.org

Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration

Y Cui, Z Yu, C Wang, Z Zhao, J Zhang… - Proceedings of the 29th …, 2021 - dl.acm.org

Vision-and-language pretraining (VLP) aims to learn generic multimodal representations
from massive image-text pairs. While various successful attempts have been proposed …

被引用次数：56 相关文章所有 4 个版本

[PDF] thecvf.com

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

被引用次数：225 相关文章所有 8 个版本

[PDF] neurips.cc

Coarse-to-fine vision-language pre-training with fusion in the backbone

ZY Dou, A Kamath, Z Gan, P Zhang… - Advances in neural …, 2022 - proceedings.neurips.cc

Abstract Vision-language (VL) pre-training has recently received considerable attention.
However, most existing end-to-end pre-training approaches either only aim to tackle VL …

被引用次数：97 相关文章所有 7 个版本

高级搜索

QQ 群

Uniter: Universal image-text representation learning

Uniter: Learning universal image-text representations

Ufo: A unified transformer for vision-language representation learning

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data

Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration

Vision-language pre-training with triple contrastive learning

Coarse-to-fine vision-language pre-training with fusion in the backbone

相关搜索

引用