Focal self-attention for local-global interactions in vision transformers

W Jia, M Sun, J Lian, S Hou - Complex & Intelligent Systems, 2022 - Springer

As basic research, it has also received increasing attention from people that the “curse of
dimensionality” will lead to increase the cost of data storage and computing; it also …

被引用次数：297 相关文章所有 4 个版本

[PDF] springer.com

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

被引用次数：112 相关文章所有 8 个版本

[PDF] thecvf.com

Internimage: Exploring large-scale vision foundation models with deformable convolutions

W Wang, J Dai, Z Chen, Z Huang, Z Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Compared to the great progress of large-scale vision transformers (ViTs) in recent years,
large-scale models based on convolutional neural networks (CNNs) are still in an early …

被引用次数：458 相关文章所有 8 个版本

[PDF] arxiv.org

Vision mamba: Efficient visual representation learning with bidirectional state space model

L Zhu, B Liao, Q Zhang, X Wang, W Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently the state space models (SSMs) with efficient hardware-aware designs, ie, the
Mamba deep learning model, have shown great potential for long sequence modeling …

被引用次数：262 相关文章所有 5 个版本

[PDF] arxiv.org

Vision transformer adapter for dense predictions

Z Chen, Y Duan, W Wang, J He, T Lu, J Dai… - arXiv preprint arXiv …, 2022 - arxiv.org

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike
recent visual transformers that introduce vision-specific inductive biases into their …

被引用次数：416 相关文章所有 3 个版本

[PDF] arxiv.org

Maxvit: Multi-axis vision transformer

Z Tu, H Talebi, H Zhang, F Yang, P Milanfar… - European conference on …, 2022 - Springer

Transformers have recently gained significant attention in the computer vision community.
However, the lack of scalability of self-attention mechanisms with respect to image size has …

被引用次数：473 相关文章所有 8 个版本

[PDF] springer.com

Visual attention network

MH Guo, CZ Lu, ZN Liu, MM Cheng, SM Hu - Computational Visual Media, 2023 - Springer

While originally designed for natural language processing tasks, the self-attention
mechanism has recently taken various computer vision areas by storm. However, the 2D …

被引用次数：515 相关文章所有 8 个版本

[PDF] thecvf.com

Stratified transformer for 3d point cloud segmentation

X Lai, J Liu, L Jiang, L Wang, H Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

Abstract 3D point cloud segmentation has made tremendous progress in recent years. Most
current methods focus on aggregating local features, but fail to directly model long-range …

被引用次数：317 相关文章所有 9 个版本

[PDF] thecvf.com

Grounded language-image pre-training

LH Li, P Zhang, H Zhang, J Yang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

This paper presents a grounded language-image pre-training (GLIP) model for learning
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …

被引用次数：753 相关文章所有 8 个版本

[PDF] thecvf.com

Vision transformer with deformable attention

Z Xia, X Pan, S Song, LE Li… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Transformers have recently shown superior performances on various vision tasks. The large,
sometimes even global, receptive field endows Transformer models with higher …

被引用次数：398 相关文章所有 6 个版本

高级搜索

QQ 群