Exploring visual prompts for adapting large-scale models

P Sahoo, AK Singh, S Saha, V Jain, S Mondal… - arXiv preprint arXiv …, 2024 - arxiv.org

Prompt engineering has emerged as an indispensable technique for extending the
capabilities of large language models (LLMs) and vision-language models (VLMs). This …

被引用次数：312 相关文章所有 4 个版本

[PDF] arxiv.org

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

被引用次数：413 相关文章所有 9 个版本

[PDF] thecvf.com

Open-vocabulary semantic segmentation with mask-adapted clip

F Liang, B Wu, X Dai, K Li, Y Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com

Open-vocabulary semantic segmentation aims to segment an image into semantic regions
according to text descriptions, which may not have been seen during training. Recent two …

被引用次数：462 相关文章所有 9 个版本

[PDF] thecvf.com

Visual prompt multi-modal tracking

J Zhu, S Lai, X Chen, D Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking
tributaries. To inherit the powerful representations of the foundation model, a natural modus …

被引用次数：205 相关文章所有 6 个版本

[PDF] arxiv.org

Vision transformer adapter for dense predictions

Z Chen, Y Duan, W Wang, J He, T Lu, J Dai… - arXiv preprint arXiv …, 2022 - arxiv.org

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike
recent visual transformers that introduce vision-specific inductive biases into their …

被引用次数：611 相关文章所有 3 个版本

[PDF] thecvf.com

Repurposing diffusion-based image generators for monocular depth estimation

B Ke, A Obukhov, S Huang, N Metzger… - Proceedings of the …, 2024 - openaccess.thecvf.com

Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth
from a single image is geometrically ill-posed and requires scene understanding so it is not …

被引用次数：239 相关文章所有 3 个版本

[PDF] thecvf.com

Simda: Simple diffusion adapter for efficient video generation

Z Xing, Q Dai, H Hu, Z Wu… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

The recent wave of AI-generated content has witnessed the great development and success
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …

被引用次数：60 相关文章所有 3 个版本

[PDF] thecvf.com

Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching

X Wu, F Zhu, R Zhao, H Li - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects
from novel categories beyond the base categories on which the detector is trained. Recent …

被引用次数：127 相关文章所有 5 个版本

[PDF] neurips.cc

Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization

J Abdul Samadh, MH Gani, N Hussein… - Advances in …, 2023 - proceedings.neurips.cc

The promising zero-shot generalization of vision-language models such as CLIP has led to
their adoption using prompt learning for numerous downstream tasks. Previous works have …

被引用次数：47 相关文章所有 2 个版本

[PDF] thecvf.com

Towards large-scale 3d representation learning with multi-dataset point prompt training

X Wu, Z Tian, X Wen, B Peng, X Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

The rapid advancement of deep learning models is often attributed to their ability to leverage
massive training data. In contrast such privilege has not yet fully benefited 3D deep learning …

被引用次数：34 相关文章所有 3 个版本

高级搜索

QQ 群