Training data-efficient image transformers & distillation through attention

H Touvron, M Cord, M Douze, F Massa… - International …, 2021 - proceedings.mlr.press
Recently, neural networks purely based on attention were shown to address image
understanding tasks such as image classification. These high-performing vision …

Going deeper with image transformers

H Touvron, M Cord, A Sablayrolles… - Proceedings of the …, 2021 - openaccess.thecvf.com
Transformers have been recently adapted for large scale image classification, achieving
high scores shaking up the long supremacy of convolutional neural networks. However the …

Cmt: Convolutional neural networks meet vision transformers

J Guo, K Han, H Wu, Y Tang, X Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
Vision transformers have been successfully applied to image recognition tasks due to their
ability to capture long-range dependencies within an image. However, there are still gaps in …

Adavit: Adaptive vision transformers for efficient image recognition

L Meng, H Li, BC Chen, S Lan, Z Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Built on top of self-attention mechanisms, vision transformers have demonstrated
remarkable performance on a variety of vision tasks recently. While achieving excellent …

Crossvit: Cross-attention multi-scale vision transformer for image classification

CFR Chen, Q Fan, R Panda - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
The recently developed vision transformer (ViT) has achieved promising results on image
classification compared to convolutional neural networks. Inspired by this, in this paper, we …

[PDF][PDF] Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet

Z Jiang, Q Hou, L Yuan, D Zhou, X Jin… - arXiv preprint arXiv …, 2021 - academia.edu
This paper provides a strong baseline for vision transformers on the ImageNet classification
task. While recent vision transformers have demonstrated promising results in ImageNet …

Incorporating convolution designs into visual transformers

K Yuan, S Guo, Z Liu, A Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com
Motivated by the success of Transformers in natural language processing (NLP) tasks, there
exist some attempts (eg, ViT and DeiT) to apply Transformers to the vision domain. However …

An image is worth 16x16 words: Transformers for image recognition at scale

A Dosovitskiy, L Beyer, A Kolesnikov… - arXiv preprint arXiv …, 2020 - arxiv.org
While the Transformer architecture has become the de-facto standard for natural language
processing tasks, its applications to computer vision remain limited. In vision, attention is …

Refiner: Refining self-attention for vision transformers

D Zhou, Y Shi, B Kang, W Yu, Z Jiang, Y Li… - arXiv preprint arXiv …, 2021 - arxiv.org
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks
compared with CNNs. Yet, they generally require much more data for model pre-training …

Tokens-to-token vit: Training vision transformers from scratch on imagenet

L Yuan, Y Chen, T Wang, W Yu, Y Shi… - Proceedings of the …, 2021 - openaccess.thecvf.com
Transformers, which are popular for language modeling, have been explored for solving
vision tasks recently, eg, the Vision Transformer (ViT) for image classification. The ViT model …