Distillm: Towards streamlined distillation for large language models

J Ko, S Kim, T Chen, SY Yun - arXiv preprint arXiv:2402.03898, 2024 - arxiv.org
Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller
student model, reducing its inference cost and memory footprint while preserving model …

Revisiting knowledge distillation for autoregressive language models

Q Zhong, L Ding, L Shen, J Liu, B Du, D Tao - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge distillation (KD) is a common approach to compress a teacher model to reduce
its inference cost and memory footprint, by training a smaller student model. However, in the …

On-policy distillation of language models: Learning from self-generated mistakes

R Agarwal, N Vieillard, Y Zhou, P Stanczyk… - The Twelfth …, 2024 - openreview.net
Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its
inference cost and memory footprint, by training a smaller student model. However, current …

Exploring and enhancing the transfer of distribution in knowledge distillation for autoregressive language models

J Rao, X Liu, Z Lin, L Ding, J Li, D Tao… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge distillation (KD) is a technique that compresses large teacher models by training
smaller student models to mimic them. The success of KD in auto-regressive language …

Dynamic knowledge distillation for pre-trained language models

L Li, Y Lin, S Ren, P Li, J Zhou, X Sun - arXiv preprint arXiv:2109.11295, 2021 - arxiv.org
Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-
trained language models. However, existing methods conduct KD statically, eg, the student …

Gkd: Generalized knowledge distillation for auto-regressive sequence models

R Agarwal, N Vieillard, P Stanczyk, S Ramos… - arXiv preprint arXiv …, 2023 - arxiv.org
Knowledge distillation is commonly used for compressing neural networks to reduce their
inference cost and memory footprint. However, current distillation methods for auto …

Mixkd: Towards efficient distillation of large-scale language models

KJ Liang, W Hao, D Shen, Y Zhou, W Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
Large-scale language models have recently demonstrated impressive empirical
performance. Nevertheless, the improved results are attained at the price of bigger models …

Cost-effective distillation of large language models

S Dasgupta, T Cohn, T Baldwin - Findings of the Association for …, 2023 - aclanthology.org
Abstract Knowledge distillation (KD) involves training a small “student” model to replicate the
strong performance of a high-capacity “teacher” model, enabling efficient deployment in …

Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - arXiv preprint arXiv:2306.08543, 2023 - arxiv.org
Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

Autoregressive knowledge distillation through imitation learning

A Lin, J Wohlwend, H Chen, T Lei - arXiv preprint arXiv:2009.07253, 2020 - arxiv.org
The performance of autoregressive models on natural language generation tasks has
dramatically improved due to the adoption of deep, self-attentive architectures. However …