Self-knowledge distillation in natural language processing

S Hahn, H Choi - arXiv preprint arXiv:1908.01851, 2019 - arxiv.org
Since deep learning became a key player in natural language processing (NLP), many deep
learning models have been showing remarkable performances in a variety of NLP tasks …

LRC-BERT: latent-representation contrastive knowledge distillation for natural language understanding

H Fu, S Zhou, Q Yang, J Tang, G Liu, K Liu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
The pre-training models such as BERT have achieved great results in various natural
language processing problems. However, a large number of parameters need significant …

Textbrewer: An open-source knowledge distillation toolkit for natural language processing

Z Yang, Y Cui, Z Chen, W Che, T Liu, S Wang… - arXiv preprint arXiv …, 2020 - arxiv.org
In this paper, we introduce TextBrewer, an open-source knowledge distillation toolkit
designed for natural language processing. It works with different neural network models and …

Improving multi-task deep neural networks via knowledge distillation for natural language understanding

X Liu, P He, W Chen, J Gao - arXiv preprint arXiv:1904.09482, 2019 - arxiv.org
This paper explores the use of knowledge distillation to improve a Multi-Task Deep Neural
Network (MT-DNN)(Liu et al., 2019) for learning text representations across multiple natural …

A survey of knowledge enhanced pre-trained models

J Yang, X Hu, G Xiao, Y Shen - arXiv preprint arXiv:2110.00269, 2021 - arxiv.org
Pre-trained language models learn informative word representations on a large-scale text
corpus through self-supervised learning, which has achieved promising performance in …

Reinforced multi-teacher selection for knowledge distillation

F Yuan, L Shou, J Pei, W Lin, M Gong, Y Fu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
In natural language processing (NLP) tasks, slow inference speed and huge footprints in
GPU usage remain the bottleneck of applying pre-trained deep models in production. As a …

Greedy-layer pruning: Speeding up transformer models for natural language processing

D Peer, S Stabinger, S Engl… - Pattern recognition …, 2022 - Elsevier
Fine-tuning transformer models after unsupervised pre-training reaches a very high
performance on many different natural language processing tasks. Unfortunately …

Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model

W Xiong, J Du, WY Wang, V Stoyanov - arXiv preprint arXiv:1912.09637, 2019 - arxiv.org
Recent breakthroughs of pretrained language models have shown the effectiveness of self-
supervised learning for a wide range of natural language processing (NLP) tasks. In addition …

Knowledge distillation across ensembles of multilingual models for low-resource languages

J Cui, B Kingsbury, B Ramabhadran… - … , Speech and Signal …, 2017 - ieeexplore.ieee.org
This paper investigates the effectiveness of knowledge distillation in the context of
multilingual models. We show that with knowledge distillation, Long Short-Term Memory …

Rethinking kullback-leibler divergence in knowledge distillation for large language models

T Wu, C Tao, J Wang, R Yang, Z Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to
compress Large Language Models (LLMs). Contrary to prior assertions that reverse …