Patient knowledge distillation for bert model compression

S Sun, Y Cheng, Z Gan, J Liu - arXiv preprint arXiv:1908.09355, 2019 - arxiv.org
Pre-trained language models such as BERT have proven to be highly effective for natural
language processing (NLP) tasks. However, the high demand for computing resources in …

Tinybert: Distilling bert for natural language understanding

X Jiao, Y Yin, L Shang, X Jiang, X Chen, L Li… - arXiv preprint arXiv …, 2019 - arxiv.org
Language model pre-training, such as BERT, has significantly improved the performances of
many natural language processing tasks. However, pre-trained language models are …

BERT-EMD: Many-to-many layer mapping for BERT compression with earth mover's distance

J Li, X Liu, H Zhao, R Xu, M Yang, Y Jin - arXiv preprint arXiv:2010.06133, 2020 - arxiv.org
Pre-trained language models (eg, BERT) have achieved significant success in various
natural language processing (NLP) tasks. However, high storage and computational costs …

One teacher is enough? pre-trained language model distillation from multiple teachers

C Wu, F Wu, Y Huang - arXiv preprint arXiv:2106.01023, 2021 - arxiv.org
Pre-trained language models (PLMs) achieve great success in NLP. However, their huge
model sizes hinder their applications in many practical systems. Knowledge distillation is a …

Ladabert: Lightweight adaptation of bert through hybrid model compression

Y Mao, Y Wang, C Wu, C Zhang, Y Wang… - arXiv preprint arXiv …, 2020 - arxiv.org
BERT is a cutting-edge language representation model pre-trained by a large corpus, which
achieves superior performances on various natural language understanding tasks …

Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

Homodistil: Homotopic task-agnostic distillation of pre-trained transformers

C Liang, H Jiang, Z Li, X Tang, B Yin, T Zhao - arXiv preprint arXiv …, 2023 - arxiv.org
Knowledge distillation has been shown to be a powerful model compression approach to
facilitate the deployment of pre-trained language models in practice. This paper focuses on …

Moebert: from bert to mixture-of-experts via importance-guided adaptation

S Zuo, Q Zhang, C Liang, P He, T Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org
Pre-trained language models have demonstrated superior performance in various natural
language processing tasks. However, these models usually contain hundreds of millions of …

XtremeDistil: Multi-stage distillation for massive multilingual models

S Mukherjee, A Awadallah - arXiv preprint arXiv:2004.05686, 2020 - arxiv.org
Deep and large pre-trained language models are the state-of-the-art for various natural
language processing tasks. However, the huge size of these models could be a deterrent to …

Dynamic knowledge distillation for pre-trained language models

L Li, Y Lin, S Ren, P Li, J Zhou, X Sun - arXiv preprint arXiv:2109.11295, 2021 - arxiv.org
Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-
trained language models. However, existing methods conduct KD statically, eg, the student …