Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data

D Qi, L Su, J Song, E Cui, T Bharti, A Sacheti - arXiv preprint arXiv …, 2020 - arxiv.org
In this paper, we introduce a new vision-language pre-trained model--ImageBERT--for
image-text joint embedding. Our model is a Transformer-based model, which takes different …

Uniter: Universal image-text representation learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - European conference on …, 2020 - Springer
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

Uniter: Learning universal image-text representations

YC Chen, L Li, L Yu, A El Kholy, F Ahmed, Z Gan… - 2019 - openreview.net
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are jointly processed for visual and textual understanding. In this …

Image-text embedding learning via visual and textual semantic reasoning

K Li, Y Zhang, K Li, Y Li, Y Fu - IEEE transactions on pattern …, 2022 - ieeexplore.ieee.org
As a bridge between language and vision domains, cross-modal retrieval between images
and texts is a hot research topic in recent years. It remains challenging because the current …

Dual-path convolutional image-text embeddings with instance loss

Z Zheng, L Zheng, M Garrett, Y Yang, M Xu… - ACM Transactions on …, 2020 - dl.acm.org
Matching images and sentences demands a fine understanding of both modalities. In this
article, we propose a new system to discriminatively embed the image and text to a shared …

Image as a foreign language: Beit pretraining for all vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck, Z Peng, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state …

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com
A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

G Li, N Duan, Y Fang, M Gong, D Jiang - Proceedings of the AAAI …, 2020 - aaai.org
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of
vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …

Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval

S Sun, YC Chen, L Li, S Wang, Y Fang… - Proceedings of the 2021 …, 2021 - aclanthology.org
Multimodal pre-training has propelled great advancement in vision-and-language research.
These large-scale pre-trained models, although successful, fatefully suffer from slow …

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …