Cost-effective distillation of large language models

S Dasgupta, T Cohn, T Baldwin - Findings of the Association for …, 2023 - aclanthology.org
Abstract Knowledge distillation (KD) involves training a small “student” model to replicate the
strong performance of a high-capacity “teacher” model, enabling efficient deployment in …

Reaugkd: Retrieval-augmented knowledge distillation for pre-trained language models

J Zhang, A Muhamed, A Anantharaman… - Proceedings of the …, 2023 - aclanthology.org
Abstract Knowledge Distillation (KD) is one of the most effective approaches to deploying
large-scale pre-trained language models in low-latency environments by transferring the …

Ddk: Distilling domain knowledge for efficient large language models

J Liu, C Zhang, J Guo, Y Zhang, H Que, K Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite the advanced intelligence abilities of large language models (LLMs) in various
applications, they still face significant computational and storage demands. Knowledge …

Mixkd: Towards efficient distillation of large-scale language models

KJ Liang, W Hao, D Shen, Y Zhou, W Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
Large-scale language models have recently demonstrated impressive empirical
performance. Nevertheless, the improved results are attained at the price of bigger models …

Homodistil: Homotopic task-agnostic distillation of pre-trained transformers

C Liang, H Jiang, Z Li, X Tang, B Yin, T Zhao - arXiv preprint arXiv …, 2023 - arxiv.org
Knowledge distillation has been shown to be a powerful model compression approach to
facilitate the deployment of pre-trained language models in practice. This paper focuses on …

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Y Wen, Z Li, W Du, L Mou - arXiv preprint arXiv:2307.15190, 2023 - arxiv.org
Knowledge distillation (KD) is the process of transferring knowledge from a large model to a
small one. It has gained increasing attention in the natural language processing community …

RAIL-KD: RAndom intermediate layer mapping for knowledge distillation

MA Haidar, N Anchuri, M Rezagholizadeh… - arXiv preprint arXiv …, 2021 - arxiv.org
Intermediate layer knowledge distillation (KD) can improve the standard KD technique
(which only targets the output of teacher and student models) especially over large pre …

Adversarial data augmentation for task-specific knowledge distillation of pre-trained transformers

M Zhang, NU Naresh, Y He - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org
Deep and large pre-trained language models (eg, BERT, GPT-3) are state-of-the-art for
various natural language processing tasks. However, the huge size of these models brings …

Distillm: Towards streamlined distillation for large language models

J Ko, S Kim, T Chen, SY Yun - arXiv preprint arXiv:2402.03898, 2024 - arxiv.org
Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller
student model, reducing its inference cost and memory footprint while preserving model …

Mkd: a multi-task knowledge distillation approach for pretrained language models

L Liu, H Wang, J Lin, R Socher, C Xiong - arXiv preprint arXiv:1911.03588, 2019 - arxiv.org
Pretrained language models have led to significant performance gains in many NLP tasks.
However, the intensive computing resources to train such models remain an issue …