A survey of visual transformers

Y Liu, Y Zhang, Y Wang, F Hou, J Yuan… - … on Neural Networks …, 2023 - ieeexplore.ieee.org
Transformer, an attention-based encoder–decoder model, has already revolutionized the
field of natural language processing (NLP). Inspired by such significant achievements, some …

What does clip know about a red circle? visual prompt engineering for vlms

A Shtedritski, C Rupprecht… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …

From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

Not all features matter: Enhancing few-shot clip with adaptive prior refinement

X Zhu, R Zhang, B He, A Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …

Revive: Regional visual representation matters in knowledge-based visual question answering

Y Lin, Y Xie, D Chen, Y Xu, C Zhu… - Advances in Neural …, 2022 - proceedings.neurips.cc
This paper revisits visual representation in knowledge-based visual question answering
(VQA) and demonstrates that using regional information in a better way can significantly …

Rsvg: Exploring data and models for visual grounding on remote sensing data

Y Zhan, Z Xiong, Y Yuan - IEEE Transactions on Geoscience …, 2023 - ieeexplore.ieee.org
In this article, we introduce the task of visual grounding for remote sensing data (RSVG).
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance …

Referring image segmentation using text supervision

F Liu, Y Liu, Y Kong, K Xu, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Existing Referring Image Segmentation (RIS) methods typically require expensive
pixel-level or box-level annotations for supervision. In this paper, we observe that the …

From images to textual prompts: Zero-shot vqa with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li, D Tao… - arXiv preprint arXiv …, 2022 - arxiv.org
Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

Cross-modal adapter for text-video retrieval

H Jiang, J Zhang, R Huang, C Ge, Z Ni, J Lu… - arXiv preprint arXiv …, 2022 - arxiv.org
Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve
the most relevant video for a given text query. Recently, pre-trained models, eg, CLIP, show …