Rethinking kullback-leibler divergence in knowledge distillation for large language models

T Wu, C Tao, J Wang, R Yang, Z Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to
compress Large Language Models (LLMs). Contrary to prior assertions that reverse …

A survey on knowledge distillation of large language models

X Xu, M Li, C Tao, T Shen, R Cheng, J Li, C Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
This survey presents an in-depth exploration of knowledge distillation (KD) techniques
within the realm of Large Language Models (LLMs), spotlighting the pivotal role of KD in …

Xtremedistiltransformers: Task transfer for task-agnostic distillation

S Mukherjee, AH Awadallah, J Gao - arXiv preprint arXiv:2106.04563, 2021 - arxiv.org
While deep and large pre-trained models are the state-of-the-art for various natural
language processing tasks, their huge size poses significant challenges for practical uses in …

Survey on knowledge distillation for large language models: methods, evaluation, and application

C Yang, Y Zhu, W Lu, Y Wang, Q Chen, C Gao… - ACM Transactions on …, 2024 - dl.acm.org
Large Language Models (LLMs) have showcased exceptional capabilities in various
domains, attracting significant interest from both academia and industry. Despite their …

Ddk: Distilling domain knowledge for efficient large language models

J Liu, C Zhang, J Guo, Y Zhang, H Que, K Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite the advanced intelligence abilities of large language models (LLMs) in various
applications, they still face significant computational and storage demands. Knowledge …

Cost-effective distillation of large language models

S Dasgupta, T Cohn, T Baldwin - Findings of the Association for …, 2023 - aclanthology.org
Abstract Knowledge distillation (KD) involves training a small “student” model to replicate the
strong performance of a high-capacity “teacher” model, enabling efficient deployment in …

Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation

Y Wu, M Rezagholizadeh, A Ghaddar… - Proceedings of the …, 2021 - aclanthology.org
Intermediate layer matching is shown as an effective approach for improving knowledge
distillation (KD). However, this technique applies matching in the hidden spaces of two …

Natural language generation for effective knowledge distillation

R Tang, Y Lu, J Lin - Proceedings of the 2nd Workshop on Deep …, 2019 - aclanthology.org
Abstract Knowledge distillation can effectively transfer knowledge from BERT, a deep
language representation model, to traditional, shallow word embedding-based neural …

Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - arXiv preprint arXiv:2306.08543, 2023 - arxiv.org
Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

LRC-BERT: latent-representation contrastive knowledge distillation for natural language understanding

H Fu, S Zhou, Q Yang, J Tang, G Liu, K Liu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
The pre-training models such as BERT have achieved great results in various natural
language processing problems. However, a large number of parameters need significant …