相关文章- 学术资源搜索

Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - arXiv preprint arXiv:2306.08543, 2023 - arxiv.org

Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

被引用次数：124 相关文章

[PDF] openreview.net

MiniLLM: Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - The Twelfth International …, 2024 - openreview.net

Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

被引用次数：163 相关文章所有 2 个版本

[PDF] arxiv.org

Dynamic knowledge distillation for pre-trained language models

L Li, Y Lin, S Ren, P Li, J Zhou, X Sun - arXiv preprint arXiv:2109.11295, 2021 - arxiv.org

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-
trained language models. However, existing methods conduct KD statically, eg, the student …

被引用次数：45 相关文章所有 4 个版本

[PDF] arxiv.org

Mixkd: Towards efficient distillation of large-scale language models

KJ Liang, W Hao, D Shen, Y Zhou, W Chen… - arXiv preprint arXiv …, 2020 - arxiv.org

Large-scale language models have recently demonstrated impressive empirical
performance. Nevertheless, the improved results are attained at the price of bigger models …

被引用次数：78 相关文章所有 8 个版本

[PDF] openreview.net

On-policy distillation of language models: Learning from self-generated mistakes

R Agarwal, N Vieillard, Y Zhou, P Stanczyk… - The Twelfth …, 2024 - openreview.net

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its
inference cost and memory footprint, by training a smaller student model. However, current …

被引用次数：70 相关文章所有 2 个版本

[PDF] aaai.org

Reinforced multi-teacher selection for knowledge distillation

F Yuan, L Shou, J Pei, W Lin, M Gong, Y Fu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org

In natural language processing (NLP) tasks, slow inference speed and huge footprints in
GPU usage remain the bottleneck of applying pre-trained deep models in production. As a …

被引用次数：132 相关文章所有 4 个版本

[PDF] aclanthology.org

Cost-effective distillation of large language models

S Dasgupta, T Cohn, T Baldwin - Findings of the Association for …, 2023 - aclanthology.org

Abstract Knowledge distillation (KD) involves training a small “student” model to replicate the
strong performance of a high-capacity “teacher” model, enabling efficient deployment in …

被引用次数：19 相关文章所有 2 个版本

[PDF] arxiv.org

Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

I Timiryasov, JL Tastet - arXiv preprint arXiv:2308.02019, 2023 - arxiv.org

We present our proposed solution to the BabyLM challenge [arXiv: 2301.11796], whose goal
was to improve the sample efficiency of language models. We trained an ensemble …

被引用次数：41 相关文章所有 3 个版本

[PDF] arxiv.org

Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org

Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

被引用次数：48 相关文章所有 8 个版本

[PDF] arxiv.org

One teacher is enough? pre-trained language model distillation from multiple teachers

C Wu, F Wu, Y Huang - arXiv preprint arXiv:2106.01023, 2021 - arxiv.org

Pre-trained language models (PLMs) achieve great success in NLP. However, their huge
model sizes hinder their applications in many practical systems. Knowledge distillation is a …

被引用次数：64 相关文章所有 3 个版本

高级搜索

QQ 群

Knowledge distillation of large language models

MiniLLM: Knowledge distillation of large language models

Dynamic knowledge distillation for pre-trained language models

Mixkd: Towards efficient distillation of large-scale language models

On-policy distillation of language models: Learning from self-generated mistakes

Reinforced multi-teacher selection for knowledge distillation

Cost-effective distillation of large language models

Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

Meta-KD: A meta knowledge distillation framework for language model compression across domains

One teacher is enough? pre-trained language model distillation from multiple teachers

引用