相关文章- 学术资源搜索

Cost-effective distillation of large language models

S Dasgupta, T Cohn, T Baldwin - Findings of the Association for …, 2023 - aclanthology.org

Abstract Knowledge distillation (KD) involves training a small “student” model to replicate the
strong performance of a high-capacity “teacher” model, enabling efficient deployment in …

被引用次数：19 相关文章所有 2 个版本

[PDF] aclanthology.org

Reaugkd: Retrieval-augmented knowledge distillation for pre-trained language models

J Zhang, A Muhamed, A Anantharaman… - Proceedings of the …, 2023 - aclanthology.org

Abstract Knowledge Distillation (KD) is one of the most effective approaches to deploying
large-scale pre-trained language models in low-latency environments by transferring the …

被引用次数：17 相关文章所有 5 个版本

[PDF] arxiv.org

Ddk: Distilling domain knowledge for efficient large language models

J Liu, C Zhang, J Guo, Y Zhang, H Que, K Deng… - arXiv preprint arXiv …, 2024 - arxiv.org

Despite the advanced intelligence abilities of large language models (LLMs) in various
applications, they still face significant computational and storage demands. Knowledge …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

Mixkd: Towards efficient distillation of large-scale language models

KJ Liang, W Hao, D Shen, Y Zhou, W Chen… - arXiv preprint arXiv …, 2020 - arxiv.org

Large-scale language models have recently demonstrated impressive empirical
performance. Nevertheless, the improved results are attained at the price of bigger models …

被引用次数：78 相关文章所有 8 个版本

[PDF] arxiv.org

Homodistil: Homotopic task-agnostic distillation of pre-trained transformers

C Liang, H Jiang, Z Li, X Tang, B Yin, T Zhao - arXiv preprint arXiv …, 2023 - arxiv.org

Knowledge distillation has been shown to be a powerful model compression approach to
facilitate the deployment of pre-trained language models in practice. This paper focuses on …

被引用次数：31 相关文章所有 4 个版本

[PDF] arxiv.org

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Y Wen, Z Li, W Du, L Mou - arXiv preprint arXiv:2307.15190, 2023 - arxiv.org

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a
small one. It has gained increasing attention in the natural language processing community …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

RAIL-KD: RAndom intermediate layer mapping for knowledge distillation

MA Haidar, N Anchuri, M Rezagholizadeh… - arXiv preprint arXiv …, 2021 - arxiv.org

Intermediate layer knowledge distillation (KD) can improve the standard KD technique
(which only targets the output of teacher and student models) especially over large pre …

被引用次数：20 相关文章所有 6 个版本

[PDF] aaai.org

Adversarial data augmentation for task-specific knowledge distillation of pre-trained transformers

M Zhang, NU Naresh, Y He - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org

Deep and large pre-trained language models (eg, BERT, GPT-3) are state-of-the-art for
various natural language processing tasks. However, the huge size of these models brings …

被引用次数：14 相关文章所有 6 个版本

[PDF] arxiv.org

Distillm: Towards streamlined distillation for large language models

J Ko, S Kim, T Chen, SY Yun - arXiv preprint arXiv:2402.03898, 2024 - arxiv.org

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller
student model, reducing its inference cost and memory footprint while preserving model …

被引用次数：26 相关文章所有 3 个版本

[PDF] arxiv.org

Mkd: a multi-task knowledge distillation approach for pretrained language models

L Liu, H Wang, J Lin, R Socher, C Xiong - arXiv preprint arXiv:1911.03588, 2019 - arxiv.org

Pretrained language models have led to significant performance gains in many NLP tasks.
However, the intensive computing resources to train such models remain an issue …

被引用次数：23 相关文章所有 2 个版本

高级搜索

QQ 群

Cost-effective distillation of large language models

Reaugkd: Retrieval-augmented knowledge distillation for pre-trained language models

Ddk: Distilling domain knowledge for efficient large language models

Mixkd: Towards efficient distillation of large-scale language models

Homodistil: Homotopic task-agnostic distillation of pre-trained transformers

f-Divergence Minimization for Sequence-Level Knowledge Distillation

RAIL-KD: RAndom intermediate layer mapping for knowledge distillation

Adversarial data augmentation for task-specific knowledge distillation of pre-trained transformers

Distillm: Towards streamlined distillation for large language models

Mkd: a multi-task knowledge distillation approach for pretrained language models

引用