Pseudo-q: Generating pseudo language queries for visual grounding

Y Liu, Y Zhang, Y Wang, F Hou, J Yuan… - … on Neural Networks …, 2023 - ieeexplore.ieee.org

Transformer, an attention-based encoder–decoder model, has already revolutionized the
field of natural language processing (NLP). Inspired by such significant achievements, some …

被引用次数：314 相关文章所有 22 个版本

[PDF] thecvf.com

What does clip know about a red circle? visual prompt engineering for vlms

A Shtedritski, C Rupprecht… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …

被引用次数：68 相关文章所有 7 个版本

[HTML] sciencedirect.com

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier

Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …

被引用次数：217 相关文章所有 4 个版本

[PDF] thecvf.com

From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

被引用次数：76 相关文章所有 5 个版本

[PDF] thecvf.com

Not all features matter: Enhancing few-shot clip with adaptive prior refinement

X Zhu, R Zhang, B He, A Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …

被引用次数：38 相关文章所有 5 个版本

[PDF] neurips.cc

Revive: Regional visual representation matters in knowledge-based visual question answering

Y Lin, Y Xie, D Chen, Y Xu, C Zhu… - Advances in Neural …, 2022 - proceedings.neurips.cc

This paper revisits visual representation in knowledge-based visual question answering
(VQA) and demonstrates that using regional information in a better way can significantly …

被引用次数：78 相关文章所有 7 个版本

[PDF] arxiv.org

Rsvg: Exploring data and models for visual grounding on remote sensing data

Y Zhan, Z Xiong, Y Yuan - IEEE Transactions on Geoscience …, 2023 - ieeexplore.ieee.org

In this article, we introduce the task of visual grounding for remote sensing data (RSVG).
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance …

被引用次数：53 相关文章所有 3 个版本

[PDF] thecvf.com

Referring image segmentation using text supervision

F Liu, Y Liu, Y Kong, K Xu, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Existing Referring Image Segmentation (RIS) methods typically require expensive
pixel-level or box-level annotations for supervision. In this paper, we observe that the …

被引用次数：11 相关文章所有 8 个版本

[PDF] arxiv.org

From images to textual prompts: Zero-shot vqa with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li, D Tao… - arXiv preprint arXiv …, 2022 - arxiv.org

Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

被引用次数：41 相关文章所有 3 个版本

[PDF] arxiv.org

Cross-modal adapter for text-video retrieval

H Jiang, J Zhang, R Huang, C Ge, Z Ni, J Lu… - arXiv preprint arXiv …, 2022 - arxiv.org

Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve
the most relevant video for a given text query. Recently, pre-trained models, eg, CLIP, show …

被引用次数：31 相关文章所有 2 个版本

高级搜索

QQ 群