In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (eg, image or language) or multimodal inputs (eg, the …
Z Chen, Y Deng, Y Li, Q Gu - arXiv preprint arXiv:2310.00927, 2023 - arxiv.org
Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (eg, text and images) to improve the model …
Recently, multimodal transformer models have gained popularity because their performance on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on …
Y Ge, J Ren, A Gallagher, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is …
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within …
Large vision-language models, such as CLIP, learn robust representations of text and images, facilitating advances in many downstream tasks, including zero-shot classification …
K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs …
Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this …
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings …