相关文章- 学术资源搜索

Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org

Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

被引用次数：48 相关文章所有 8 个版本

[PDF] arxiv.org

One teacher is enough? pre-trained language model distillation from multiple teachers

C Wu, F Wu, Y Huang - arXiv preprint arXiv:2106.01023, 2021 - arxiv.org

Pre-trained language models (PLMs) achieve great success in NLP. However, their huge
model sizes hinder their applications in many practical systems. Knowledge distillation is a …

被引用次数：64 相关文章所有 3 个版本

[PDF] aclanthology.org

Knowledge distillation with reptile meta-learning for pretrained language model compression

X Ma, J Wang, LC Yu, X Zhang - Proceedings of the 29th …, 2022 - aclanthology.org

The billions, and sometimes even trillions, of parameters involved in pre-trained language
models significantly hamper their deployment in resource-constrained devices and real-time …

被引用次数：9 相关文章所有 2 个版本

[PDF] mlr.press

Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press

Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

被引用次数：74 相关文章所有 7 个版本

[PDF] arxiv.org

Contrastive distillation on intermediate representations for language model compression

S Sun, Z Gan, Y Cheng, Y Fang, S Wang… - arXiv preprint arXiv …, 2020 - arxiv.org

Existing language model compression methods mostly use a simple L2 loss to distill
knowledge in the intermediate representations of a large BERT model to a smaller one …

被引用次数：79 相关文章所有 3 个版本

[PDF] arxiv.org

Mixkd: Towards efficient distillation of large-scale language models

KJ Liang, W Hao, D Shen, Y Zhou, W Chen… - arXiv preprint arXiv …, 2020 - arxiv.org

Large-scale language models have recently demonstrated impressive empirical
performance. Nevertheless, the improved results are attained at the price of bigger models …

被引用次数：78 相关文章所有 8 个版本

[PDF] openreview.net

Extreme language model compression with optimal subwords and shared projections

S Zhao, R Gupta, Y Song, D Zhou - 2019 - openreview.net

Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet
have recently achieved state-of-the-art performance on a variety of language understanding …

被引用次数：63 相关文章

[PDF] arxiv.org

Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models

S Wang, C Wang, J Gao, Z Qi, H Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org

This study proposes a knowledge distillation algorithm based on large language models
and feature alignment, aiming to effectively transfer the knowledge of large pre-trained …

被引用次数：5 相关文章

[PDF] arxiv.org

Patient knowledge distillation for bert model compression

S Sun, Y Cheng, Z Gan, J Liu - arXiv preprint arXiv:1908.09355, 2019 - arxiv.org

Pre-trained language models such as BERT have proven to be highly effective for natural
language processing (NLP) tasks. However, the high demand for computing resources in …

被引用次数：937 相关文章所有 4 个版本

[PDF] arxiv.org

A short study on compressing decoder-based language models

T Li, YE Mesbahi, I Kobyzev, A Rashid… - arXiv preprint arXiv …, 2021 - arxiv.org

Pre-trained Language Models (PLMs) have been successful for a wide range of natural
language processing (NLP) tasks. The state-of-the-art of PLMs, however, are extremely …

被引用次数：24 相关文章所有 4 个版本

高级搜索

QQ 群

Meta-KD: A meta knowledge distillation framework for language model compression across domains

One teacher is enough? pre-trained language model distillation from multiple teachers

Knowledge distillation with reptile meta-learning for pretrained language model compression

Less is more: Task-aware layer-wise distillation for language model compression

Contrastive distillation on intermediate representations for language model compression

Mixkd: Towards efficient distillation of large-scale language models

Extreme language model compression with optimal subwords and shared projections

Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models

Patient knowledge distillation for bert model compression

A short study on compressing decoder-based language models

相关搜索

引用