Natural language generation for effective knowledge distillation

R Tang, Y Lu, J Lin - Proceedings of the 2nd Workshop on Deep …, 2019 - aclanthology.org
Abstract Knowledge distillation can effectively transfer knowledge from BERT, a deep
language representation model, to traditional, shallow word embedding-based neural …

Adversarial data augmentation for task-specific knowledge distillation of pre-trained transformers

M Zhang, NU Naresh, Y He - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org
Deep and large pre-trained language models (eg, BERT, GPT-3) are state-of-the-art for
various natural language processing tasks. However, the huge size of these models brings …

LRC-BERT: latent-representation contrastive knowledge distillation for natural language understanding

H Fu, S Zhou, Q Yang, J Tang, G Liu, K Liu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
The pre-training models such as BERT have achieved great results in various natural
language processing problems. However, a large number of parameters need significant …

Textbrewer: An open-source knowledge distillation toolkit for natural language processing

Z Yang, Y Cui, Z Chen, W Che, T Liu, S Wang… - arXiv preprint arXiv …, 2020 - arxiv.org
In this paper, we introduce TextBrewer, an open-source knowledge distillation toolkit
designed for natural language processing. It works with different neural network models and …

Xtremedistiltransformers: Task transfer for task-agnostic distillation

S Mukherjee, AH Awadallah, J Gao - arXiv preprint arXiv:2106.04563, 2021 - arxiv.org
While deep and large pre-trained models are the state-of-the-art for various natural
language processing tasks, their huge size poses significant challenges for practical uses in …

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

C Wang, Y Lu, Y Mu, Y Hu, T Xiao, J Zhu - arXiv preprint arXiv:2302.00444, 2023 - arxiv.org
Knowledge distillation addresses the problem of transferring knowledge from a teacher
model to a student model. In this process, we typically have multiple types of knowledge …

Prompting to distill: Boosting data-free knowledge distillation via reinforced prompt

X Ma, X Wang, G Fang, Y Shen, W Lu - arXiv preprint arXiv:2205.07523, 2022 - arxiv.org
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the
dependence of original training data, and has recently achieved impressive results in …

Self-knowledge distillation in natural language processing

S Hahn, H Choi - arXiv preprint arXiv:1908.01851, 2019 - arxiv.org
Since deep learning became a key player in natural language processing (NLP), many deep
learning models have been showing remarkable performances in a variety of NLP tasks …

RW-KD: Sample-wise loss terms re-weighting for knowledge distillation

P Lu, A Ghaddar, A Rashid… - Findings of the …, 2021 - aclanthology.org
Abstract Knowledge Distillation (KD) is extensively used in Natural Language Processing to
compress the pre-training and task-specific fine-tuning phases of large neural language …

Rethinking kullback-leibler divergence in knowledge distillation for large language models

T Wu, C Tao, J Wang, R Yang, Z Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to
compress Large Language Models (LLMs). Contrary to prior assertions that reverse …