Merging Vision Transformers from Different Tasks and Domains

P Ye, C Huang, M Shen, T Chen, Y Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
This work targets to merge various Vision Transformers (ViTs) trained on different tasks (ie,
datasets with different object categories) or domains (ie, datasets with the same categories …

BViT: Broad attention-based vision transformer

N Li, Y Chen, W Li, Z Ding, D Zhao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Recent works have demonstrated that transformer can achieve promising performance in
computer vision, by exploiting the relationship among image patches with self-attention …

Deepvit: Towards deeper vision transformer

D Zhou, B Kang, X Jin, L Yang, X Lian, Z Jiang… - arXiv preprint arXiv …, 2021 - arxiv.org
Vision transformers (ViTs) have been successfully applied in image classification tasks
recently. In this paper, we show that, unlike convolution neural networks (CNNs) that can be …

Enhancing performance of vision transformers on small datasets through local inductive bias incorporation

IB Akkaya, SS Kathiresan, E Arani, B Zonooz - Pattern Recognition, 2024 - Elsevier
Vision transformers (ViTs) achieve remarkable performance on large datasets, but tend to
perform worse than convolutional neural networks (CNNs) when trained from scratch on …

SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection

F Ataiefard, W Ahmed, H Hajimolahoseini… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision transformers are known to be more computationally and data-intensive than CNN
models. These transformer models such as ViT, require all the input image tokens to learn …

A unified pruning framework for vision transformers

H Yu, J Wu - Science China Information Sciences, 2023 - Springer
Conclusion In this study, we proposed a novel method called UP-ViTs to prune ViTs in a
unified manner. Our framework can prune all components in a ViT and its variants, maintain …

Oamixer: Object-aware mixing layer for vision transformers

H Kang, S Mo, J Shin - arXiv preprint arXiv:2212.06595, 2022 - arxiv.org
Patch-based models, eg, Vision Transformers (ViTs) and Mixers, have shown impressive
results on various visual recognition tasks, alternating classic convolutional networks. While …

The principle of diversity: Training stronger vision transformers calls for reducing all levels of redundancy

T Chen, Z Zhang, Y Cheng… - Proceedings of the …, 2022 - openaccess.thecvf.com
Vision transformers (ViTs) have gained increasing popularity as they are commonly believed
to own higher modeling capacity and representation flexibility, than traditional convolutional …

Holistically explainable vision transformers

M Böhle, M Fritz, B Schiele - arXiv preprint arXiv:2301.08669, 2023 - arxiv.org
Transformers increasingly dominate the machine learning landscape across many tasks and
domains, which increases the importance for understanding their outputs. While their …

Not all patches are what you need: Expediting vision transformers via token reorganizations

Y Liang, C Ge, Z Tong, Y Song, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head
self-attention (MHSA) among them. Complete leverage of these image tokens brings …