A systematic survey of prompt engineering in large language models: Techniques and applications

P Sahoo, AK Singh, S Saha, V Jain, S Mondal… - arXiv preprint arXiv …, 2024 - arxiv.org
Prompt engineering has emerged as an indispensable technique for extending the
capabilities of large language models (LLMs) and vision-language models (VLMs). This …

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

Open-vocabulary semantic segmentation with mask-adapted clip

F Liang, B Wu, X Dai, K Li, Y Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Open-vocabulary semantic segmentation aims to segment an image into semantic regions
according to text descriptions, which may not have been seen during training. Recent two …

Visual prompt multi-modal tracking

J Zhu, S Lai, X Chen, D Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Visible-modal object tracking gives rise to a series of downstream multi-modal tracking
tributaries. To inherit the powerful representations of the foundation model, a natural modus …

Vision transformer adapter for dense predictions

Z Chen, Y Duan, W Wang, J He, T Lu, J Dai… - arXiv preprint arXiv …, 2022 - arxiv.org
This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike
recent visual transformers that introduce vision-specific inductive biases into their …

Repurposing diffusion-based image generators for monocular depth estimation

B Ke, A Obukhov, S Huang, N Metzger… - Proceedings of the …, 2024 - openaccess.thecvf.com
Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth
from a single image is geometrically ill-posed and requires scene understanding so it is not …

Simda: Simple diffusion adapter for efficient video generation

Z Xing, Q Dai, H Hu, Z Wu… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The recent wave of AI-generated content has witnessed the great development and success
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …

Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching

X Wu, F Zhu, R Zhao, H Li - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects
from novel categories beyond the base categories on which the detector is trained. Recent …

Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization

J Abdul Samadh, MH Gani, N Hussein… - Advances in …, 2023 - proceedings.neurips.cc
The promising zero-shot generalization of vision-language models such as CLIP has led to
their adoption using prompt learning for numerous downstream tasks. Previous works have …

Towards large-scale 3d representation learning with multi-dataset point prompt training

X Wu, Z Tian, X Wen, B Peng, X Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
The rapid advancement of deep learning models is often attributed to their ability to leverage
massive training data. In contrast such privilege has not yet fully benefited 3D deep learning …