Vt-clip: Enhancing vision-language models with visual-guided texts

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

被引用次数：142 相关文章所有 9 个版本

[PDF] neurips.cc

Test-time prompt tuning for zero-shot generalization in vision-language models

M Shu, W Nie, DA Huang, Z Yu… - Advances in …, 2022 - proceedings.neurips.cc

Pre-trained vision-language models (eg, CLIP) have shown promising zero-shot
generalization in many downstream tasks with properly designed text prompts. Instead of …

被引用次数：177 相关文章所有 8 个版本

[PDF] thecvf.com

Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners

R Zhang, X Hu, B Li, S Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Visual recognition in low-data regimes requires deep neural networks to learn generalized
representations from limited training samples. Recently, CLIP-based methods have shown …

被引用次数：108 相关文章所有 5 个版本

[PDF] thecvf.com

Sus-x: Training-free name-only transfer of vision-language models

V Udandarao, A Gupta… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet
effective way to train large-scale vision-language models. CLIP demonstrates impressive …

被引用次数：57 相关文章所有 5 个版本

[PDF] thecvf.com

Not all features matter: Enhancing few-shot clip with adaptive prior refinement

X Zhu, R Zhang, B He, A Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …

被引用次数：34 相关文章所有 5 个版本

[PDF] aaai.org

Calip: Zero-shot enhancement of clip with parameter-free attention

Z Guo, R Zhang, L Qiu, X Ma, X Miao, X He… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

Abstract Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual
representations with promising zero-shot performance. To further improve its downstream …

被引用次数：67 相关文章所有 4 个版本

[PDF] thecvf.com

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com

The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

被引用次数：12 相关文章所有 4 个版本

[PDF] thecvf.com

Viewrefer: Grasp the multi-view knowledge for 3d visual grounding

Z Guo, Y Tang, R Zhang, D Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Understanding 3D scenes from multi-view inputs has been proven to alleviate the view
discrepancy issue in 3D visual grounding. However, existing methods normally neglect the …

被引用次数：16 相关文章所有 3 个版本

[PDF] openreview.net

Prompt learning with optimal transport for vision-language models

G Chen, W Yao, X Song, X Li, Y Rao, K Zhang - 2022 - openreview.net

With the increasing attention to large vision-language models such as CLIP, there has been
a significant amount of effort dedicated to building efficient prompts. Unlike conventional …

被引用次数：84 相关文章所有 2 个版本

[PDF] thecvf.com

Tem-adapter: Adapting image-text pretraining for video question answer

G Chen, X Liu, G Wang, K Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Video-language pre-trained models have shown remarkable success in guiding video
question-answering (VideoQA) tasks. However, due to the length of video sequences …

被引用次数：11 相关文章所有 6 个版本

高级搜索

QQ 群