Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into small ones (ie, student models). The student distills knowledge from the teacher by …
Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time …
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the demand for model compression. Despite various methods to compress BERT …
Factorizing a large matrix into small matrices is a popular strategy for model compression. Singular value decomposition (SVD) plays a vital role in this compression strategy …
The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over …
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding …
This paper explores a new post-hoc training-free compression paradigm for compressing Large Language Models (LLMs) to facilitate their wider adoption in various computing …
S Wang, C Wang, J Gao, Z Qi, H Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
This study proposes a knowledge distillation algorithm based on large language models and feature alignment, aiming to effectively transfer the knowledge of large pre-trained …
The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution …