A short study on compressing decoder-based language models

T Li, YE Mesbahi, I Kobyzev, A Rashid… - arXiv preprint arXiv …, 2021 - arxiv.org
Pre-trained Language Models (PLMs) have been successful for a wide range of natural
language processing (NLP) tasks. The state-of-the-art of PLMs, however, are extremely …

Kroneckerbert: Learning kronecker decomposition for pre-trained language models via knowledge distillation

MS Tahaei, E Charlaix, VP Nia, A Ghodsi… - arXiv preprint arXiv …, 2021 - arxiv.org
The development of over-parameterized pre-trained language models has made a
significant contribution toward the success of natural language processing. While over …

Robustness challenges in model distillation and pruning for natural language understanding

M Du, S Mukherjee, Y Cheng, M Shokouhi… - arXiv preprint arXiv …, 2021 - arxiv.org
Recent work has focused on compressing pre-trained language models (PLMs) like BERT
where the major focus has been to improve the in-distribution performance for downstream …

KroneckerBERT: Significant compression of pre-trained language models through kronecker decomposition and knowledge distillation

M Tahaei, E Charlaix, V Nia, A Ghodsi… - Proceedings of the …, 2022 - aclanthology.org
The development of over-parameterized pre-trained language models has made a
significant contribution toward the success of natural language processing. While over …

Compressing pre-trained language models by matrix decomposition

MB Noach, Y Goldberg - Proceedings of the 1st Conference of the …, 2020 - aclanthology.org
Large pre-trained language models reach state-of-the-art results on many different NLP
tasks when fine-tuned individually; They also come with a significant memory and …

Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models

S Wang, C Wang, J Gao, Z Qi, H Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
This study proposes a knowledge distillation algorithm based on large language models
and feature alignment, aiming to effectively transfer the knowledge of large pre-trained …

Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

Kronecker decomposition for gpt compression

A Edalati, M Tahaei, A Rashid, VP Nia, JJ Clark… - arXiv preprint arXiv …, 2021 - arxiv.org
GPT is an auto-regressive Transformer-based pre-trained language model which has
attracted a lot of attention in the natural language processing (NLP) domain due to its state …

Revisiting intermediate layer distillation for compressing language models: An overfitting perspective

J Ko, S Park, M Jeong, S Hong, E Ahn… - arXiv preprint arXiv …, 2023 - arxiv.org
Knowledge distillation (KD) is a highly promising method for mitigating the computational
problems of pre-trained language models (PLMs). Among various KD approaches …

Moebert: from bert to mixture-of-experts via importance-guided adaptation

S Zuo, Q Zhang, C Liang, P He, T Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org
Pre-trained language models have demonstrated superior performance in various natural
language processing tasks. However, these models usually contain hundreds of millions of …