Adaptformer: Adapting vision transformers for scalable visual recognition

S Chen, C Ge, Z Tong, J Wang… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Pretraining Vision Transformers (ViTs) has achieved great success in visual
recognition. A following scenario is to adapt a ViT to various image and video recognition …

Hiera: A hierarchical vision transformer without the bells-and-whistles

C Ryali, YT Hu, D Bolya, C Wei, H Fan… - International …, 2023 - proceedings.mlr.press
Modern hierarchical vision transformers have added several vision-specific components in
the pursuit of supervised classification performance. While these components lead to …

Rest: An efficient transformer for visual recognition

Q Zhang, YB Yang - Advances in neural information …, 2021 - proceedings.neurips.cc
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably
served as a general-purpose backbone for image recognition. Unlike existing Transformer …

Adavit: Adaptive vision transformers for efficient image recognition

L Meng, H Li, BC Chen, S Lan, Z Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Built on top of self-attention mechanisms, vision transformers have demonstrated
remarkable performance on a variety of vision tasks recently. While achieving excellent …

Hivit: A simpler and more efficient design of hierarchical vision transformer

X Zhang, Y Tian, L Xie, W Huang, Q Dai… - The Eleventh …, 2023 - openreview.net
There has been a debate on the choice of plain vs. hierarchical vision transformers, where
researchers often believe that the former (eg, ViT) has a simpler design but the latter (eg …

Volo: Vision outlooker for visual recognition

L Yuan, Q Hou, Z Jiang, J Feng… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Recently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With
low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the …

Scalable vision transformers with hierarchical pooling

Z Pan, B Zhuang, J Liu, H He… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
The recently proposed Visual image Transformers (ViT) with pure attention have achieved
promising performance on image recognition tasks, such as image classification. However …

Not all patches are what you need: Expediting vision transformers via token reorganizations

Y Liang, C Ge, Z Tong, Y Song, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head
self-attention (MHSA) among them. Complete leverage of these image tokens brings …

Discrete representations strengthen vision transformer robustness

C Mao, L Jiang, M Dehghani, C Vondrick… - arXiv preprint arXiv …, 2021 - arxiv.org
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image
recognition. While recent studies suggest that ViTs are more robust than their convolutional …

Visformer: The vision-friendly transformer

Z Chen, L Xie, J Niu, X Liu, L Wei… - Proceedings of the …, 2021 - openaccess.thecvf.com
The past year has witnessed the rapid development of applying the Transformer module to
vision problems. While some researchers have demonstrated that Transformer-based …