Cross-lingual cross-modal pretraining for multimodal retrieval

H Fei, T Yu, P Li - Proceedings of the 2021 Conference of the …, 2021 - aclanthology.org
Recent pretrained vision-language models have achieved impressive performance on cross-
modal retrieval tasks in English. Their success, however, heavily depends on the availability …

Ufo: A unified transformer for vision-language representation learning

J Wang, X Hu, Z Gan, Z Yang, X Dai, Z Liu, Y Lu… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of
processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …

Understanding transferable representation learning and zero-shot transfer in clip

Z Chen, Y Deng, Y Li, Q Gu - arXiv preprint arXiv:2310.00927, 2023 - arxiv.org
Multi-modal learning has become increasingly popular due to its ability to leverage
information from different data sources (eg, text and images) to improve the model …

Decoupling the role of data, attention, and losses in multimodal transformers

LA Hendricks, J Mellor, R Schneider… - Transactions of the …, 2021 - direct.mit.edu
Recently, multimodal transformer models have gained popularity because their performance
on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on …

Improving zero-shot generalization and robustness of multi-modal models

Y Ge, J Ren, A Gallagher, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive
performance on image classification benchmarks and their zero-shot generalization ability is …

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

Teaching clip to count to ten

R Paiss, A Ephrat, O Tov, S Zada… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large vision-language models, such as CLIP, learn robust representations of text and
images, facilitating advances in many downstream tasks, including zero-shot classification …

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …

Pali: A jointly-scaled multilingual language-image model

X Chen, X Wang, S Changpinyo… - arXiv preprint arXiv …, 2022 - arxiv.org
Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …

Regionclip: Region-based language-image pretraining

Y Zhong, J Yang, P Zhang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved
impressive results on image classification in both zero-shot and transfer learning settings …