Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com
Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

M Zhou, L Zhou, S Wang, Y Cheng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Vision-and-language pre-training has achieved impressive success in learning multimodal
representations between vision and language. To generalize this success to non-English …

COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

K Wen, J Xia, Y Huang, L Li, J Xu… - Proceedings of the …, 2021 - openaccess.thecvf.com
There has been a recent surge of interest in cross-modal pre-training. However, existed
approaches pre-train a one-stream model to learn joint vision-language representation …

ASIF: Coupled data turns unimodal models to multimodal without training

A Norelli, M Fumero, V Maiorca… - Advances in …, 2024 - proceedings.neurips.cc
CLIP proved that aligning visual and language spaces is key to solving many vision tasks
without explicit training, but required to train image and text encoders from scratch on a huge …

Distilled dual-encoder model for vision-language understanding

Z Wang, W Wang, H Zhu, M Liu, B Qin, F Wei - arXiv preprint arXiv …, 2021 - arxiv.org
We propose a cross-modal attention distillation framework to train a dual-encoder model for
vision-language understanding tasks, such as visual reasoning and visual question …

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

Pubmedclip: How much does clip benefit visual question answering in the medical domain?

S Eslami, C Meinel, G De Melo - Findings of the Association for …, 2023 - aclanthology.org
Abstract Contrastive Language–Image Pre-training (CLIP) has shown remarkable success
in learning with cross-modal supervision from extensive amounts of image–text pairs …

Vt-clip: Enhancing vision-language models with visual-guided texts

L Qiu, R Zhang, Z Guo, Z Zeng, Z Guo, Y Li… - arXiv preprint arXiv …, 2021 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for
its transferable visual representation learning. However, due to the semantic gap within …

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

C Schuhmann, R Vencu, R Beaumont… - arXiv preprint arXiv …, 2021 - arxiv.org
Multi-modal language-vision models trained on hundreds of millions of image-text pairs (eg
CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero-or few …