MiniPLM: Knowledge Distillation for Pre-Training Language Models

Y Gu, H Zhou, F Meng, J Zhou, M Huang - arXiv preprint arXiv:2410.17215, 2024 - arxiv.org
Knowledge distillation (KD) is widely used to train small, high-performing student language
models (LMs) using large teacher LMs. While effective in fine-tuning, KD during pre-training …

Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - arXiv preprint arXiv:2306.08543, 2023 - arxiv.org
Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

Ddk: Distilling domain knowledge for efficient large language models

J Liu, C Zhang, J Guo, Y Zhang, H Que, K Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite the advanced intelligence abilities of large language models (LLMs) in various
applications, they still face significant computational and storage demands. Knowledge …

PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning

G Kim, D Jang, E Yang - arXiv preprint arXiv:2402.12842, 2024 - arxiv.org
Recent advancements in large language models (LLMs) have raised concerns about
inference costs, increasing the need for research into model compression. While knowledge …

SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models

J Koo, Y Hwang, Y Kim, T Kang, H Bae… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite the success of Large Language Models (LLMs), they still face challenges related to
high inference costs and memory requirements. To address these issues, Knowledge …

MiniLLM: Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - The Twelfth International …, 2024 - openreview.net
Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

Dynamic knowledge distillation for pre-trained language models

L Li, Y Lin, S Ren, P Li, J Zhou, X Sun - arXiv preprint arXiv:2109.11295, 2021 - arxiv.org
Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-
trained language models. However, existing methods conduct KD statically, eg, the student …

Revisiting knowledge distillation for autoregressive language models

Q Zhong, L Ding, L Shen, J Liu, B Du, D Tao - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge distillation (KD) is a common approach to compress a teacher model to reduce
its inference cost and memory footprint, by training a smaller student model. However, in the …

[HTML][HTML] AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

Q Zhou, P Li, Y Liu, Y Guan, Q Xing, M Chen, M Sun - AI Open, 2023 - Elsevier
Abstract Knowledge distillation (KD) is a widely used method for transferring knowledge
from large teacher models to computationally efficient student models. Unfortunately, the …

Pre-training Distillation for Large Language Models: A Design Space Exploration

H Peng, X Lv, Y Bai, Z Yao, J Zhang, L Hou… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a
smaller student model. Previous work applying KD in the field of large language models …