Intermediate layer matching is shown as an effective approach for improving knowledge distillation (KD). However, this technique applies matching in the hidden spaces of two …
L Ding, Y Yang - 2020 IEEE International Conference on …, 2020 - ieeexplore.ieee.org
The success of a pretraining model like BERT in Natural Language Processing (NLP) puts forward the demand for model compression. Previous works adopting knowledge distillation …
F Yuan, L Shou, J Pei, W Lin, M Gong, Y Fu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a …
J Ko, S Park, M Jeong, S Hong, E Ahn… - arXiv preprint arXiv …, 2023 - arxiv.org
Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches …
We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small …
Abstract Knowledge Distillation (KD) is extensively used in Natural Language Processing to compress the pre-training and task-specific fine-tuning phases of large neural language …
M Zhang, NU Naresh, Y He - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org
Deep and large pre-trained language models (eg, BERT, GPT-3) are state-of-the-art for various natural language processing tasks. However, the huge size of these models brings …
Knowledge Distillation (KD) is a model compression algorithm that helps transfer the knowledge of a large neural network into a smaller one. Even though KD has shown …
S Fu, Z Lai, Y Zhang, Y Liu, X Yang - Neurocomputing, 2022 - Elsevier
To reduce the computational consumption and memory footprint of powerful deep neural networks for applications on edge devices, many model compression methods have been …