Pre-trained vision-language models (eg, CLIP) have shown promising zero-shot generalization in many downstream tasks with properly designed text prompts. Instead of …
Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown …
V Udandarao, A Gupta… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive …
Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its application to diverse downstream vision tasks. To improve its capacity on downstream …
Abstract Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with promising zero-shot performance. To further improve its downstream …
F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com
The ability to associate touch with other modalities has huge implications for humans and computational systems. However multimodal learning with touch remains challenging due to …
Z Guo, Y Tang, R Zhang, D Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the …
With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional …
G Chen, X Liu, G Wang, K Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video-language pre-trained models have shown remarkable success in guiding video question-answering (VideoQA) tasks. However, due to the length of video sequences …