Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data

D Qi, L Su, J Song, E Cui, T Bharti, A Sacheti - arXiv preprint arXiv …, 2020 - arxiv.org
In this paper, we introduce a new vision-language pre-trained model--ImageBERT--for
image-text joint embedding. Our model is a Transformer-based model, which takes different …

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

D Qi, L Su, J Song, E Cui, T Bharti, A Sacheti - arXiv e-prints, 2020 - ui.adsabs.harvard.edu
In this paper, we introduce a new vision-language pre-trained model--ImageBERT--for
image-text joint embedding. Our model is a Transformer-based model, which takes different …