K Jiang, P Peng, Y Lian, W Xu - Journal of Visual Communication and …, 2022 - Elsevier
Abstract In contrast to Convolutional Neural Networks (CNNs), Vision Transformers (ViT)
cannot capture sequence ordering of input tokens and require position embeddings. As a …