相关文章- 学术资源搜索

RAIL-KD: RAndom intermediate layer mapping for knowledge distillation

MA Haidar, N Anchuri, M Rezagholizadeh… - arXiv preprint arXiv …, 2021 - arxiv.org

Intermediate layer knowledge distillation (KD) can improve the standard KD technique
(which only targets the output of teacher and student models) especially over large pre …

被引用次数：20 相关文章所有 6 个版本

[HTML] sciencedirect.com

[HTML][HTML] AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

Q Zhou, P Li, Y Liu, Y Guan, Q Xing, M Chen, M Sun - AI Open, 2023 - Elsevier

Abstract Knowledge distillation (KD) is a widely used method for transferring knowledge
from large teacher models to computationally efficient student models. Unfortunately, the …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

C Wang, Y Lu, Y Mu, Y Hu, T Xiao, J Zhu - arXiv preprint arXiv:2302.00444, 2023 - arxiv.org

Knowledge distillation addresses the problem of transferring knowledge from a teacher
model to a student model. In this process, we typically have multiple types of knowledge …

被引用次数：4 相关文章所有 5 个版本

[PDF] aclanthology.org

Reaugkd: Retrieval-augmented knowledge distillation for pre-trained language models

J Zhang, A Muhamed, A Anantharaman… - Proceedings of the …, 2023 - aclanthology.org

Abstract Knowledge Distillation (KD) is one of the most effective approaches to deploying
large-scale pre-trained language models in low-latency environments by transferring the …

被引用次数：17 相关文章所有 5 个版本

[PDF] arxiv.org

Gradient knowledge distillation for pre-trained language models

L Wang, L Li, X Sun - arXiv preprint arXiv:2211.01071, 2022 - arxiv.org

Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-
scale teacher to a compact yet well-performing student. Previous KD practices for pre …

被引用次数：5 相关文章所有 2 个版本

[PDF] aclanthology.org

Cost-effective distillation of large language models

S Dasgupta, T Cohn, T Baldwin - Findings of the Association for …, 2023 - aclanthology.org

Abstract Knowledge distillation (KD) involves training a small “student” model to replicate the
strong performance of a high-capacity “teacher” model, enabling efficient deployment in …

被引用次数：19 相关文章所有 2 个版本

[PDF] arxiv.org

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Y Wen, Z Li, W Du, L Mou - arXiv preprint arXiv:2307.15190, 2023 - arxiv.org

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a
small one. It has gained increasing attention in the natural language processing community …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

Ddk: Distilling domain knowledge for efficient large language models

J Liu, C Zhang, J Guo, Y Zhang, H Que, K Deng… - arXiv preprint arXiv …, 2024 - arxiv.org

Despite the advanced intelligence abilities of large language models (LLMs) in various
applications, they still face significant computational and storage demands. Knowledge …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

MiniPLM: Knowledge Distillation for Pre-Training Language Models

Y Gu, H Zhou, F Meng, J Zhou, M Huang - arXiv preprint arXiv:2410.17215, 2024 - arxiv.org

Knowledge distillation (KD) is widely used to train small, high-performing student language
models (LMs) using large teacher LMs. While effective in fine-tuning, KD during pre-training …

被引用次数：1 相关文章

[PDF] arxiv.org

Dynamic knowledge distillation for pre-trained language models

L Li, Y Lin, S Ren, P Li, J Zhou, X Sun - arXiv preprint arXiv:2109.11295, 2021 - arxiv.org

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-
trained language models. However, existing methods conduct KD statically, eg, the student …

被引用次数：45 相关文章所有 4 个版本

高级搜索

QQ 群

RAIL-KD: RAndom intermediate layer mapping for knowledge distillation

[HTML][HTML] AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Reaugkd: Retrieval-augmented knowledge distillation for pre-trained language models

Gradient knowledge distillation for pre-trained language models

Cost-effective distillation of large language models

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Ddk: Distilling domain knowledge for efficient large language models

MiniPLM: Knowledge Distillation for Pre-Training Language Models

Dynamic knowledge distillation for pre-trained language models

相关搜索

引用