D Qi, L Su, J Song, E Cui, T Bharti, A Sacheti - arXiv preprint arXiv …, 2020 - arxiv.org
In this paper, we introduce a new vision-language pre-trained model--ImageBERT--for
image-text joint embedding. Our model is a Transformer-based model, which takes different …