Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient …
A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …
T Gupta, A Kembhavi - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task …
Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This …
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task …
J Li, D Li, C Xiong, S Hoi - International conference on …, 2022 - proceedings.mlr.press
Abstract Vision-Language Pre-training (VLP) has advanced the performance for many vision- language tasks. However, most existing pre-trained models only excel in either …
Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode the compositional relationships between …
Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of domains recently. However, it is costly to update the entire parameter set of large pre-trained …
Vision-language representation learning largely benefits from image-text alignment through contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …