Contrastive distillation on intermediate representations for language model compression

S Sun, Z Gan, Y Cheng, Y Fang, S Wang… - arXiv preprint arXiv …, 2020 - arxiv.org
Existing language model compression methods mostly use a simple L2 loss to distill
knowledge in the intermediate representations of a large BERT model to a smaller one …

Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press
Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

Compression of generative pre-trained language models via quantization

C Tao, L Hou, W Zhang, L Shang, X Jiang, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
The increasing size of generative Pre-trained Language Models (PLMs) has greatly
increased the demand for model compression. Despite various methods to compress BERT …

Language model compression with weighted low-rank factorization

YC Hsu, T Hua, S Chang, Q Lou, Y Shen… - arXiv preprint arXiv …, 2022 - arxiv.org
Factorizing a large matrix into small matrices is a popular strategy for model compression.
Singular value decomposition (SVD) plays a vital role in this compression strategy …

Kroneckerbert: Learning kronecker decomposition for pre-trained language models via knowledge distillation

MS Tahaei, E Charlaix, VP Nia, A Ghodsi… - arXiv preprint arXiv …, 2021 - arxiv.org
The development of over-parameterized pre-trained language models has made a
significant contribution toward the success of natural language processing. While over …

Extreme language model compression with optimal subwords and shared projections

S Zhao, R Gupta, Y Song, D Zhou - 2019 - openreview.net
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet
have recently achieved state-of-the-art performance on a variety of language understanding …

Asvd: Activation-aware singular value decomposition for compressing large language models

Z Yuan, Y Shang, Y Song, Q Wu, Y Yan… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper explores a new post-hoc training-free compression paradigm for compressing
Large Language Models (LLMs) to facilitate their wider adoption in various computing …

Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models

S Wang, C Wang, J Gao, Z Qi, H Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
This study proposes a knowledge distillation algorithm based on large language models
and feature alignment, aiming to effectively transfer the knowledge of large pre-trained …

Binarybert: Pushing the limit of bert quantization

H Bai, W Zhang, L Hou, L Shang, J Jin, X Jiang… - arXiv preprint arXiv …, 2020 - arxiv.org
The rapid development of large pre-trained language models has greatly increased the
demand for model compression techniques, among which quantization is a popular solution …