Automatic student network search for knowledge distillation

Z Zhang, W Zhu, J Yan, P Gao… - 2020 25th International …, 2021 - ieeexplore.ieee.org
Pre-trained language models (PLMs), such as BERT, have achieved outstanding
performance on multiple natural language processing (NLP) tasks. However, such pre …

Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation

Y Wu, M Rezagholizadeh, A Ghaddar… - Proceedings of the …, 2021 - aclanthology.org
Intermediate layer matching is shown as an effective approach for improving knowledge
distillation (KD). However, this technique applies matching in the hidden spaces of two …

SDSK2BERT: Explore the specific depth with specific knowledge to compress BERT

L Ding, Y Yang - 2020 IEEE International Conference on …, 2020 - ieeexplore.ieee.org
The success of a pretraining model like BERT in Natural Language Processing (NLP) puts
forward the demand for model compression. Previous works adopting knowledge distillation …

Reinforced multi-teacher selection for knowledge distillation

F Yuan, L Shou, J Pei, W Lin, M Gong, Y Fu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
In natural language processing (NLP) tasks, slow inference speed and huge footprints in
GPU usage remain the bottleneck of applying pre-trained deep models in production. As a …

Revisiting intermediate layer distillation for compressing language models: An overfitting perspective

J Ko, S Park, M Jeong, S Hong, E Ahn… - arXiv preprint arXiv …, 2023 - arxiv.org
Knowledge distillation (KD) is a highly promising method for mitigating the computational
problems of pre-trained language models (PLMs). Among various KD approaches …

Which student is best? a comprehensive knowledge distillation exam for task-specific bert models

MN Nityasya, HA Wibowo, R Chevi, RE Prasojo… - arXiv preprint arXiv …, 2022 - arxiv.org
We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher
models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small …

RW-KD: Sample-wise loss terms re-weighting for knowledge distillation

P Lu, A Ghaddar, A Rashid… - Findings of the …, 2021 - aclanthology.org
Abstract Knowledge Distillation (KD) is extensively used in Natural Language Processing to
compress the pre-training and task-specific fine-tuning phases of large neural language …

Adversarial data augmentation for task-specific knowledge distillation of pre-trained transformers

M Zhang, NU Naresh, Y He - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org
Deep and large pre-trained language models (eg, BERT, GPT-3) are state-of-the-art for
various natural language processing tasks. However, the huge size of these models brings …

How to select one among all? an extensive empirical study towards the robustness of knowledge distillation in natural language understanding

T Li, A Rashid, A Jafari, P Sharma, A Ghodsi… - arXiv preprint arXiv …, 2021 - arxiv.org
Knowledge Distillation (KD) is a model compression algorithm that helps transfer the
knowledge of a large neural network into a smaller one. Even though KD has shown …

Relay knowledge distillation for efficiently boosting the performance of shallow networks

S Fu, Z Lai, Y Zhang, Y Liu, X Yang - Neurocomputing, 2022 - Elsevier
To reduce the computational consumption and memory footprint of powerful deep neural
networks for applications on edge devices, many model compression methods have been …