Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

One teacher is enough? pre-trained language model distillation from multiple teachers

C Wu, F Wu, Y Huang - arXiv preprint arXiv:2106.01023, 2021 - arxiv.org
Pre-trained language models (PLMs) achieve great success in NLP. However, their huge
model sizes hinder their applications in many practical systems. Knowledge distillation is a …

Knowledge distillation with reptile meta-learning for pretrained language model compression

X Ma, J Wang, LC Yu, X Zhang - Proceedings of the 29th …, 2022 - aclanthology.org
The billions, and sometimes even trillions, of parameters involved in pre-trained language
models significantly hamper their deployment in resource-constrained devices and real-time …

Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press
Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

Contrastive distillation on intermediate representations for language model compression

S Sun, Z Gan, Y Cheng, Y Fang, S Wang… - arXiv preprint arXiv …, 2020 - arxiv.org
Existing language model compression methods mostly use a simple L2 loss to distill
knowledge in the intermediate representations of a large BERT model to a smaller one …

Mixkd: Towards efficient distillation of large-scale language models

KJ Liang, W Hao, D Shen, Y Zhou, W Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
Large-scale language models have recently demonstrated impressive empirical
performance. Nevertheless, the improved results are attained at the price of bigger models …

Extreme language model compression with optimal subwords and shared projections

S Zhao, R Gupta, Y Song, D Zhou - 2019 - openreview.net
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet
have recently achieved state-of-the-art performance on a variety of language understanding …

Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models

S Wang, C Wang, J Gao, Z Qi, H Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
This study proposes a knowledge distillation algorithm based on large language models
and feature alignment, aiming to effectively transfer the knowledge of large pre-trained …

Patient knowledge distillation for bert model compression

S Sun, Y Cheng, Z Gan, J Liu - arXiv preprint arXiv:1908.09355, 2019 - arxiv.org
Pre-trained language models such as BERT have proven to be highly effective for natural
language processing (NLP) tasks. However, the high demand for computing resources in …

A short study on compressing decoder-based language models

T Li, YE Mesbahi, I Kobyzev, A Rashid… - arXiv preprint arXiv …, 2021 - arxiv.org
Pre-trained Language Models (PLMs) have been successful for a wide range of natural
language processing (NLP) tasks. The state-of-the-art of PLMs, however, are extremely …