Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press
Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

Contrastive distillation on intermediate representations for language model compression

S Sun, Z Gan, Y Cheng, Y Fang, S Wang… - arXiv preprint arXiv …, 2020 - arxiv.org
Existing language model compression methods mostly use a simple L2 loss to distill
knowledge in the intermediate representations of a large BERT model to a smaller one …

Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

Asvd: Activation-aware singular value decomposition for compressing large language models

Z Yuan, Y Shang, Y Song, Q Wu, Y Yan… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper explores a new post-hoc training-free compression paradigm for compressing
Large Language Models (LLMs) to facilitate their wider adoption in various computing …

Language model compression with weighted low-rank factorization

YC Hsu, T Hua, S Chang, Q Lou, Y Shen… - arXiv preprint arXiv …, 2022 - arxiv.org
Factorizing a large matrix into small matrices is a popular strategy for model compression.
Singular value decomposition (SVD) plays a vital role in this compression strategy …

From dense to sparse: Contrastive pruning for better pre-trained language model compression

R Xu, F Luo, C Wang, B Chang, J Huang… - Proceedings of the …, 2022 - ojs.aaai.org
Abstract Pre-trained Language Models (PLMs) have achieved great success in various
Natural Language Processing (NLP) tasks under the pre-training and fine-tuning paradigm …

One teacher is enough? pre-trained language model distillation from multiple teachers

C Wu, F Wu, Y Huang - arXiv preprint arXiv:2106.01023, 2021 - arxiv.org
Pre-trained language models (PLMs) achieve great success in NLP. However, their huge
model sizes hinder their applications in many practical systems. Knowledge distillation is a …

Losparse: Structured compression of large language models based on low-rank and sparse approximation

Y Li, Y Yu, Q Zhang, C Liang, P He… - International …, 2023 - proceedings.mlr.press
Transformer models have achieved remarkable results in various natural language tasks,
but they are often prohibitively large, requiring massive memories and computational …

Compression of generative pre-trained language models via quantization

C Tao, L Hou, W Zhang, L Shang, X Jiang, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
The increasing size of generative Pre-trained Language Models (PLMs) has greatly
increased the demand for model compression. Despite various methods to compress BERT …

Compressing large language models by joint sparsification and quantization

J Guo, J Wu, Z Wang, J Liu, G Yang, Y Ding… - … on Machine Learning, 2024 - openreview.net
In this paper, we introduce a novel model compression technique named Joint Sparsification
and Quantization (JSQ), explicitly tailored for large language models (LLMs). Traditional …