Abstract Knowledge Distillation (KD) is one of the most effective approaches to deploying large-scale pre-trained language models in low-latency environments by transferring the …
Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge …
Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models …
Knowledge distillation has been shown to be a powerful model compression approach to facilitate the deployment of pre-trained language models in practice. This paper focuses on …
Y Wen, Z Li, W Du, L Mou - arXiv preprint arXiv:2307.15190, 2023 - arxiv.org
Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community …
Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre …
M Zhang, NU Naresh, Y He - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org
Deep and large pre-trained language models (eg, BERT, GPT-3) are state-of-the-art for various natural language processing tasks. However, the huge size of these models brings …
Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model …
Pretrained language models have led to significant performance gains in many NLP tasks. However, the intensive computing resources to train such models remain an issue …