[HTML][HTML] Review of large vision models and visual prompt engineering

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023 - Elsevier
Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …

Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, C Li, J Dong, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Vision transformer adapter for dense predictions

Z Chen, Y Duan, W Wang, J He, T Lu, J Dai… - arXiv preprint arXiv …, 2022 - arxiv.org
This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike
recent visual transformers that introduce vision-specific inductive biases into their …

Medical sam adapter: Adapting segment anything model for medical image segmentation

J Wu, W Ji, Y Liu, H Fu, M Xu, Y Xu, Y Jin - arXiv preprint arXiv:2304.12620, 2023 - arxiv.org
The Segment Anything Model (SAM) has recently gained popularity in the field of image
segmentation due to its impressive capabilities in various segmentation tasks and its prompt …

Visual prompt multi-modal tracking

J Zhu, S Lai, X Chen, D Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Visible-modal object tracking gives rise to a series of downstream multi-modal tracking
tributaries. To inherit the powerful representations of the foundation model, a natural modus …

Scaling & shifting your features: A new baseline for efficient model tuning

D Lian, D Zhou, J Feng, X Wang - Advances in Neural …, 2022 - proceedings.neurips.cc
Existing fine-tuning methods either tune all parameters of the pre-trained model (full fine-
tuning), which is not efficient, or only tune the last linear layer (linear probing), which suffers …

Deep class-incremental learning: A survey

DW Zhou, QW Wang, ZH Qi, HJ Ye, DC Zhan… - arXiv preprint arXiv …, 2023 - arxiv.org
Deep models, eg, CNNs and Vision Transformers, have achieved impressive achievements
in many vision tasks in the closed world. However, novel classes emerge from time to time in …

Simda: Simple diffusion adapter for efficient video generation

Z Xing, Q Dai, H Hu, Z Wu… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The recent wave of AI-generated content has witnessed the great development and success
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …

Explicit visual prompting for low-level structure segmentations

W Liu, X Shen, CM Pun, X Cun - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We consider the generic problem of detecting low-level structures in images, which includes
segmenting the manipulated parts, identifying out-of-focus pixels, separating shadow …

Cheap and quick: Efficient vision-language instruction tuning for large language models

G Luo, Y Zhou, T Ren, S Chen… - Advances in Neural …, 2024 - proceedings.neurips.cc
Recently, growing interest has been aroused in extending the multimodal capability of large
language models (LLMs), eg, vision-language (VL) learning, which is regarded as the next …