Uniter: Universal image-text representation learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - European conference on …, 2020 - Springer
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

Uniter: Learning universal image-text representations

YC Chen, L Li, L Yu, A El Kholy, F Ahmed, Z Gan… - 2019 - openreview.net
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are jointly processed for visual and textual understanding. In this …

Ufo: A unified transformer for vision-language representation learning

J Wang, X Hu, Z Gan, Z Yang, X Dai, Z Liu, Y Lu… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of
processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

G Li, N Duan, Y Fang, M Gong, D Jiang - Proceedings of the AAAI …, 2020 - aaai.org
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of
vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

M Zhou, L Zhou, S Wang, Y Cheng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Vision-and-language pre-training has achieved impressive success in learning multimodal
representations between vision and language. To generalize this success to non-English …

Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data

D Qi, L Su, J Song, E Cui, T Bharti, A Sacheti - arXiv preprint arXiv …, 2020 - arxiv.org
In this paper, we introduce a new vision-language pre-trained model--ImageBERT--for
image-text joint embedding. Our model is a Transformer-based model, which takes different …

Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

Z Huang, Z Zeng, B Liu, D Fu, J Fu - arXiv preprint arXiv:2004.00849, 2020 - arxiv.org
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers
that jointly learn visual and language embedding in a unified end-to-end framework. We aim …

Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration

Y Cui, Z Yu, C Wang, Z Zhao, J Zhang… - Proceedings of the 29th …, 2021 - dl.acm.org
Vision-and-language pretraining (VLP) aims to learn generic multimodal representations
from massive image-text pairs. While various successful attempts have been proposed …

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

Coarse-to-fine vision-language pre-training with fusion in the backbone

ZY Dou, A Kamath, Z Gan, P Zhang… - Advances in neural …, 2022 - proceedings.neurips.cc
Abstract Vision-language (VL) pre-training has recently received considerable attention.
However, most existing end-to-end pre-training approaches either only aim to tackle VL …