Robust cross-modal representation learning with progressive self-distillation

A Andonian, S Chen, R Hamid - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
The learning objective of vision-language approach of CLIP does not effectively account for
the noisy many-to-many correspondences found in web-harvested image captioning …

Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

Z Huang, Z Zeng, B Liu, D Fu, J Fu - arXiv preprint arXiv:2004.00849, 2020 - arxiv.org
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers
that jointly learn visual and language embedding in a unified end-to-end framework. We aim …

Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic

Y Tewel, Y Shalev, I Schwartz… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Recent text-to-image matching models apply contrastive learning to large corpora of
uncurated pairs of images and sentences. While such models can provide a powerful score …

Fusion of detected objects in text for visual question answering

C Alberti, J Ling, M Collins, D Reitter - arXiv preprint arXiv:1908.05054, 2019 - arxiv.org
To advance models of multimodal context, we introduce a simple yet powerful neural
architecture for data that combines vision and natural language. The" Bounding Boxes in …

Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval

S Sun, YC Chen, L Li, S Wang, Y Fang… - Proceedings of the 2021 …, 2021 - aclanthology.org
Multimodal pre-training has propelled great advancement in vision-and-language research.
These large-scale pre-trained models, although successful, fatefully suffer from slow …

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press
Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

Grounding language models to images for multimodal inputs and outputs

JY Koh, R Salakhutdinov… - … Conference on Machine …, 2023 - proceedings.mlr.press
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling

J Wang, H Wang, J Deng, W Wu, D Zhang - arXiv preprint arXiv …, 2021 - arxiv.org
While large scale pre-training has achieved great achievements in bridging the gap
between vision and language, it still faces several challenges. First, the cost for pre-training …

Visually-augmented language modeling

W Wang, L Dong, H Cheng, H Song, X Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
Human language is grounded on multimodal knowledge including visual knowledge like
colors, sizes, and shapes. However, current large-scale pre-trained language models rely …

WenLan: Bridging vision and language by large-scale multi-modal pre-training

Y Huo, M Zhang, G Liu, H Lu, Y Gao, G Yang… - arXiv preprint arXiv …, 2021 - arxiv.org
Multi-modal pre-training models have been intensively explored to bridge vision and
language in recent years. However, most of them explicitly model the cross-modal …