相关文章- 学术资源搜索

Distillm: Towards streamlined distillation for large language models

J Ko, S Kim, T Chen, SY Yun - arXiv preprint arXiv:2402.03898, 2024 - arxiv.org

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller
student model, reducing its inference cost and memory footprint while preserving model …

被引用次数：26 相关文章所有 3 个版本

[PDF] arxiv.org

Revisiting knowledge distillation for autoregressive language models

Q Zhong, L Ding, L Shen, J Liu, B Du, D Tao - arXiv preprint arXiv …, 2024 - arxiv.org

Knowledge distillation (KD) is a common approach to compress a teacher model to reduce
its inference cost and memory footprint, by training a smaller student model. However, in the …

被引用次数：11 相关文章所有 2 个版本

[PDF] openreview.net

On-policy distillation of language models: Learning from self-generated mistakes

R Agarwal, N Vieillard, Y Zhou, P Stanczyk… - The Twelfth …, 2024 - openreview.net

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its
inference cost and memory footprint, by training a smaller student model. However, current …

被引用次数：70 相关文章所有 2 个版本

[PDF] arxiv.org

Exploring and enhancing the transfer of distribution in knowledge distillation for autoregressive language models

J Rao, X Liu, Z Lin, L Ding, J Li, D Tao… - arXiv preprint arXiv …, 2024 - arxiv.org

Knowledge distillation (KD) is a technique that compresses large teacher models by training
smaller student models to mimic them. The success of KD in auto-regressive language …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Dynamic knowledge distillation for pre-trained language models

L Li, Y Lin, S Ren, P Li, J Zhou, X Sun - arXiv preprint arXiv:2109.11295, 2021 - arxiv.org

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-
trained language models. However, existing methods conduct KD statically, eg, the student …

被引用次数：45 相关文章所有 4 个版本

[PDF] arxiv.org

Gkd: Generalized knowledge distillation for auto-regressive sequence models

R Agarwal, N Vieillard, P Stanczyk, S Ramos… - arXiv preprint arXiv …, 2023 - arxiv.org

Knowledge distillation is commonly used for compressing neural networks to reduce their
inference cost and memory footprint. However, current distillation methods for auto …

被引用次数：64 相关文章

[PDF] arxiv.org

Mixkd: Towards efficient distillation of large-scale language models

KJ Liang, W Hao, D Shen, Y Zhou, W Chen… - arXiv preprint arXiv …, 2020 - arxiv.org

Large-scale language models have recently demonstrated impressive empirical
performance. Nevertheless, the improved results are attained at the price of bigger models …

被引用次数：78 相关文章所有 8 个版本

[PDF] aclanthology.org

Cost-effective distillation of large language models

S Dasgupta, T Cohn, T Baldwin - Findings of the Association for …, 2023 - aclanthology.org

Abstract Knowledge distillation (KD) involves training a small “student” model to replicate the
strong performance of a high-capacity “teacher” model, enabling efficient deployment in …

被引用次数：19 相关文章所有 2 个版本

[PDF] arxiv.org

Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - arXiv preprint arXiv:2306.08543, 2023 - arxiv.org

Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

被引用次数：124 相关文章

[PDF] arxiv.org

Autoregressive knowledge distillation through imitation learning

A Lin, J Wohlwend, H Chen, T Lei - arXiv preprint arXiv:2009.07253, 2020 - arxiv.org

The performance of autoregressive models on natural language generation tasks has
dramatically improved due to the adoption of deep, self-attentive architectures. However …

被引用次数：31 相关文章所有 5 个版本

高级搜索

QQ 群

Distillm: Towards streamlined distillation for large language models

Revisiting knowledge distillation for autoregressive language models

On-policy distillation of language models: Learning from self-generated mistakes

Exploring and enhancing the transfer of distribution in knowledge distillation for autoregressive language models

Dynamic knowledge distillation for pre-trained language models

Gkd: Generalized knowledge distillation for auto-regressive sequence models

Mixkd: Towards efficient distillation of large-scale language models

Cost-effective distillation of large language models

Knowledge distillation of large language models

Autoregressive knowledge distillation through imitation learning

引用