Categories of response-based, feature-based, and relation-based knowledge distillation

C Yang, X Yu, Z An, Y Xu - … Distillation: Towards New Horizons of Intelligent …, 2023 - Springer
Deep neural networks have achieved remarkable performance for artificial intelligence
tasks. The success behind intelligent systems often relies on large-scale models with high …

A survey of the self supervised learning mechanisms for vision transformers

A Khan, A Sohail, M Fiaz, M Hassan, TH Afridi… - arXiv preprint arXiv …, 2024 - arxiv.org
Deep supervised learning models require high volume of labeled data to attain sufficiently
good results. Although, the practice of gathering and annotating such big data is costly and …

Maskedkd: Efficient distillation of vision transformers with masked images

S Son, N Lee, J Lee - arXiv preprint arXiv:2302.10494, 2023 - arxiv.org
Knowledge distillation is an effective method for training lightweight models, but it introduces
a significant amount of computational overhead to the training cost, as the method requires …

The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

S Son, J Ryu, N Lee, J Lee - European Conference on Computer Vision, 2025 - Springer
Abstract Knowledge distillation is an effective method for training lightweight vision models.
However, acquiring teacher supervision for training samples is often costly, especially from …

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

AC Li, Y Tian, B Chen, D Pathak, X Chen - arXiv preprint arXiv:2411.09702, 2024 - arxiv.org
Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves
downstream performance by learning useful representations. Is this actually true? We …

Prototype-guided Attention Distillation for Discriminative Person Search

H Kim, J Lee, K Sohn - IEEE transactions on pattern analysis …, 2024 - ieeexplore.ieee.org
Person search aims to localize a person of interest in a large image gallery captured by
multiple, non-overlapping cameras. Prevalent unified methods have suffered from (1) noisy …

Simple Unsupervised Knowledge Distillation With Space Similarity

A Singh, H Wang - European Conference on Computer Vision, 2025 - Springer
As per recent studies, Self-supervised learning (SSL) does not readily extend to smaller
architectures. One direction to mitigate this shortcoming while simultaneously training a …

Knowledge Distillation in RNN-Attention Models for Early Prediction of Student Performance

S Leelaluk, C Tang, V Švábenský… - arXiv preprint arXiv …, 2024 - arxiv.org
Educational data mining (EDM) is a part of applied computing that focuses on automatically
analyzing data from learning contexts. Early prediction for identifying at-risk students is a …

KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer

K Zhao, N Ukita - arXiv preprint arXiv:2302.11208, 2023 - arxiv.org
Scaled dot-product attention applies a softmax function on the scaled dot-product of queries
and keys to calculate weights and then multiplies the weights and values. In this work, we …

Exemplar-Free Continual Learning in Vision Transformers via Feature Attention Distillation

X Dai, J Cheng, Z Wei, B Du - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
In this paper, we propose a new approach for continual learning based on the Visual
Transformers (ViTs). The purpose of continual learning is to address the catastrophic …