Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation

T Kim, J Oh, NY Kim, S Cho, SY Yun - arXiv preprint arXiv:2105.08919, 2021 - arxiv.org
Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a
lightweight student model, has been investigated to design efficient neural architectures …

What makes a" good" data augmentation in knowledge distillation-a statistical perspective

H Wang, S Lohit, MN Jones… - Advances in Neural …, 2022 - proceedings.neurips.cc
Abstract Knowledge distillation (KD) is a general neural network training approach that uses
a teacher model to guide the student model. Existing works mainly study KD from the …

Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective

H Zhou, L Song, J Chen, Y Zhou, G Wang… - arXiv preprint arXiv …, 2021 - arxiv.org
Knowledge distillation is an effective approach to leverage a well-trained network or an
ensemble of them, named as the teacher, to guide the training of a student network. The …

Preparing lessons: Improve knowledge distillation with better supervision

T Wen, S Lai, X Qian - Neurocomputing, 2021 - Elsevier
Abstract Knowledge distillation (KD) is widely applied in the training of efficient neural
network. A compact model, which is trained to mimic the representation of a cumbersome …

Asymmetric temperature scaling makes larger networks teach well again

XC Li, WS Fan, S Song, Y Li… - Advances in neural …, 2022 - proceedings.neurips.cc
Abstract Knowledge Distillation (KD) aims at transferring the knowledge of a well-performed
neural network (the {\it teacher}) to a weaker one (the {\it student}). A peculiar phenomenon …

Teach less, learn more: On the undistillable classes in knowledge distillation

Y Zhu, N Liu, Z Xu, X Liu, W Meng… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Knowledge distillation (KD) can effectively compress neural networks by training a
smaller network (student) to simulate the behavior of a larger one (teacher). A counter …

Understanding and improving knowledge distillation

J Tang, R Shivanna, Z Zhao, D Lin, A Singh… - arXiv preprint arXiv …, 2020 - arxiv.org
Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while
having a fixed capacity budget. It is a commonly used technique for model compression …

One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation

Z Hao, J Guo, K Han, Y Tang, H Hu… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Knowledge distillation (KD) has proven to be a highly effective approach for
enhancing model performance through a teacher-student training scheme. However, most …

Knowledge distillation beyond model compression

F Sarfraz, E Arani, B Zonooz - 2020 25th International …, 2021 - ieeexplore.ieee.org
Knowledge distillation (KD) is commonly deemed as an effective model compression
technique in which a compact model (student) is trained under the supervision of a larger …

Logit standardization in knowledge distillation

S Sun, W Ren, J Li, R Wang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Knowledge distillation involves transferring soft labels from a teacher to a student
using a shared temperature-based softmax function. However the assumption of a shared …