相关文章- 学术资源搜索

Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press

Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

被引用次数：74 相关文章所有 7 个版本

[PDF] arxiv.org

Contrastive distillation on intermediate representations for language model compression

S Sun, Z Gan, Y Cheng, Y Fang, S Wang… - arXiv preprint arXiv …, 2020 - arxiv.org

Existing language model compression methods mostly use a simple L2 loss to distill
knowledge in the intermediate representations of a large BERT model to a smaller one …

被引用次数：79 相关文章所有 3 个版本

[PDF] arxiv.org

Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org

Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

被引用次数：48 相关文章所有 8 个版本

[PDF] arxiv.org

Asvd: Activation-aware singular value decomposition for compressing large language models

Z Yuan, Y Shang, Y Song, Q Wu, Y Yan… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper explores a new post-hoc training-free compression paradigm for compressing
Large Language Models (LLMs) to facilitate their wider adoption in various computing …

被引用次数：35 相关文章所有 2 个版本

[PDF] arxiv.org

Language model compression with weighted low-rank factorization

YC Hsu, T Hua, S Chang, Q Lou, Y Shen… - arXiv preprint arXiv …, 2022 - arxiv.org

Factorizing a large matrix into small matrices is a popular strategy for model compression.
Singular value decomposition (SVD) plays a vital role in this compression strategy …

被引用次数：87 相关文章所有 3 个版本

[PDF] aaai.org

From dense to sparse: Contrastive pruning for better pre-trained language model compression

R Xu, F Luo, C Wang, B Chang, J Huang… - Proceedings of the …, 2022 - ojs.aaai.org

Abstract Pre-trained Language Models (PLMs) have achieved great success in various
Natural Language Processing (NLP) tasks under the pre-training and fine-tuning paradigm …

被引用次数：27 相关文章所有 7 个版本

[PDF] arxiv.org

One teacher is enough? pre-trained language model distillation from multiple teachers

C Wu, F Wu, Y Huang - arXiv preprint arXiv:2106.01023, 2021 - arxiv.org

Pre-trained language models (PLMs) achieve great success in NLP. However, their huge
model sizes hinder their applications in many practical systems. Knowledge distillation is a …

被引用次数：64 相关文章所有 3 个版本

[PDF] mlr.press

Losparse: Structured compression of large language models based on low-rank and sparse approximation

Y Li, Y Yu, Q Zhang, C Liang, P He… - International …, 2023 - proceedings.mlr.press

Transformer models have achieved remarkable results in various natural language tasks,
but they are often prohibitively large, requiring massive memories and computational …

被引用次数：74 相关文章所有 7 个版本

[PDF] arxiv.org

Compression of generative pre-trained language models via quantization

C Tao, L Hou, W Zhang, L Shang, X Jiang, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

The increasing size of generative Pre-trained Language Models (PLMs) has greatly
increased the demand for model compression. Despite various methods to compress BERT …

被引用次数：93 相关文章所有 6 个版本

[PDF] openreview.net

Compressing large language models by joint sparsification and quantization

J Guo, J Wu, Z Wang, J Liu, G Yang, Y Ding… - … on Machine Learning, 2024 - openreview.net

In this paper, we introduce a novel model compression technique named Joint Sparsification
and Quantization (JSQ), explicitly tailored for large language models (LLMs). Traditional …

被引用次数：10 相关文章

高级搜索

QQ 群

Less is more: Task-aware layer-wise distillation for language model compression

Contrastive distillation on intermediate representations for language model compression

Meta-KD: A meta knowledge distillation framework for language model compression across domains

Asvd: Activation-aware singular value decomposition for compressing large language models

Language model compression with weighted low-rank factorization

From dense to sparse: Contrastive pruning for better pre-trained language model compression

One teacher is enough? pre-trained language model distillation from multiple teachers

Losparse: Structured compression of large language models based on low-rank and sparse approximation

Compression of generative pre-trained language models via quantization

Compressing large language models by joint sparsification and quantization

引用