Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation

Y Wu, M Rezagholizadeh, A Ghaddar… - Proceedings of the …, 2021 - aclanthology.org
Intermediate layer matching is shown as an effective approach for improving knowledge
distillation (KD). However, this technique applies matching in the hidden spaces of two …

One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation

Z Hao, J Guo, K Han, Y Tang, H Hu… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Knowledge distillation (KD) has proven to be a highly effective approach for
enhancing model performance through a teacher-student training scheme. However, most …

Automatic student network search for knowledge distillation

Z Zhang, W Zhu, J Yan, P Gao… - 2020 25th International …, 2021 - ieeexplore.ieee.org
Pre-trained language models (PLMs), such as BERT, have achieved outstanding
performance on multiple natural language processing (NLP) tasks. However, such pre …

Soft Hybrid Knowledge Distillation against deep neural networks

J Zhang, Z Tao, S Zhang, Z Qiao, K Guo - Neurocomputing, 2024 - Elsevier
Traditional knowledge distillation approaches are typically designed for specific tasks, as
they primarily distilling deep features from intermediate layers of a neural network, generally …

Multi-level knowledge distillation via dynamic decision boundaries exploration and exploitation

Z Tao, H Li, J Zhang, S Zhang - Information Fusion, 2024 - Elsevier
Existing knowledge distillation methods directly transfer knowledge from different
intermediate layers of the teacher model without differentiating their correctness. However …

Alp-kd: Attention-based layer projection for knowledge distillation

P Passban, Y Wu, M Rezagholizadeh… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
Abstract Knowledge distillation is considered as a training and compression strategy in
which two neural networks, namely a teacher and a student, are coupled together during …

RW-KD: Sample-wise loss terms re-weighting for knowledge distillation

P Lu, A Ghaddar, A Rashid… - Findings of the …, 2021 - aclanthology.org
Abstract Knowledge Distillation (KD) is extensively used in Natural Language Processing to
compress the pre-training and task-specific fine-tuning phases of large neural language …

Hybrid mix-up contrastive knowledge distillation

J Zhang, Z Tao, K Guo, H Li, S Zhang - Information Sciences, 2024 - Elsevier
Abstract Knowledge distillation (KD) aims to build a lightweight deep neural network model
under the guidance of a large-scale teacher model for model simplicity. Despite improved …

Knowledge distillation from a stronger teacher

T Huang, S You, F Wang, C Qian… - Advances in Neural …, 2022 - proceedings.neurips.cc
Unlike existing knowledge distillation methods focus on the baseline settings, where the
teacher models and training strategies are not that strong and competing as state-of-the-art …

An embarrassingly simple approach for knowledge distillation

M Gao, Y Shen, Q Li, J Yan, L Wan, D Lin… - arXiv preprint arXiv …, 2018 - arxiv.org
Knowledge Distillation (KD) aims at improving the performance of a low-capacity student
model by inheriting knowledge from a high-capacity teacher model. Previous KD methods …