Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - arXiv preprint arXiv:2306.08543, 2023 - arxiv.org
Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

MiniLLM: Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - The Twelfth International …, 2024 - openreview.net
Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

Dynamic knowledge distillation for pre-trained language models

L Li, Y Lin, S Ren, P Li, J Zhou, X Sun - arXiv preprint arXiv:2109.11295, 2021 - arxiv.org
Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-
trained language models. However, existing methods conduct KD statically, eg, the student …

Mixkd: Towards efficient distillation of large-scale language models

KJ Liang, W Hao, D Shen, Y Zhou, W Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
Large-scale language models have recently demonstrated impressive empirical
performance. Nevertheless, the improved results are attained at the price of bigger models …

On-policy distillation of language models: Learning from self-generated mistakes

R Agarwal, N Vieillard, Y Zhou, P Stanczyk… - The Twelfth …, 2024 - openreview.net
Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its
inference cost and memory footprint, by training a smaller student model. However, current …

Reinforced multi-teacher selection for knowledge distillation

F Yuan, L Shou, J Pei, W Lin, M Gong, Y Fu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
In natural language processing (NLP) tasks, slow inference speed and huge footprints in
GPU usage remain the bottleneck of applying pre-trained deep models in production. As a …

Cost-effective distillation of large language models

S Dasgupta, T Cohn, T Baldwin - Findings of the Association for …, 2023 - aclanthology.org
Abstract Knowledge distillation (KD) involves training a small “student” model to replicate the
strong performance of a high-capacity “teacher” model, enabling efficient deployment in …

Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

I Timiryasov, JL Tastet - arXiv preprint arXiv:2308.02019, 2023 - arxiv.org
We present our proposed solution to the BabyLM challenge [arXiv: 2301.11796], whose goal
was to improve the sample efficiency of language models. We trained an ensemble …

Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

One teacher is enough? pre-trained language model distillation from multiple teachers

C Wu, F Wu, Y Huang - arXiv preprint arXiv:2106.01023, 2021 - arxiv.org
Pre-trained language models (PLMs) achieve great success in NLP. However, their huge
model sizes hinder their applications in many practical systems. Knowledge distillation is a …