相关文章- 学术资源搜索

Patient knowledge distillation for bert model compression

S Sun, Y Cheng, Z Gan, J Liu - arXiv preprint arXiv:1908.09355, 2019 - arxiv.org

Pre-trained language models such as BERT have proven to be highly effective for natural
language processing (NLP) tasks. However, the high demand for computing resources in …

被引用次数：937 相关文章所有 4 个版本

[PDF] arxiv.org

Tinybert: Distilling bert for natural language understanding

X Jiao, Y Yin, L Shang, X Jiang, X Chen, L Li… - arXiv preprint arXiv …, 2019 - arxiv.org

Language model pre-training, such as BERT, has significantly improved the performances of
many natural language processing tasks. However, pre-trained language models are …

被引用次数：2031 相关文章所有 4 个版本

[PDF] arxiv.org

BERT-EMD: Many-to-many layer mapping for BERT compression with earth mover's distance

J Li, X Liu, H Zhao, R Xu, M Yang, Y Jin - arXiv preprint arXiv:2010.06133, 2020 - arxiv.org

Pre-trained language models (eg, BERT) have achieved significant success in various
natural language processing (NLP) tasks. However, high storage and computational costs …

被引用次数：54 相关文章所有 3 个版本

[PDF] arxiv.org

One teacher is enough? pre-trained language model distillation from multiple teachers

C Wu, F Wu, Y Huang - arXiv preprint arXiv:2106.01023, 2021 - arxiv.org

Pre-trained language models (PLMs) achieve great success in NLP. However, their huge
model sizes hinder their applications in many practical systems. Knowledge distillation is a …

被引用次数：64 相关文章所有 3 个版本

[PDF] arxiv.org

Ladabert: Lightweight adaptation of bert through hybrid model compression

Y Mao, Y Wang, C Wu, C Zhang, Y Wang… - arXiv preprint arXiv …, 2020 - arxiv.org

BERT is a cutting-edge language representation model pre-trained by a large corpus, which
achieves superior performances on various natural language understanding tasks …

被引用次数：71 相关文章所有 5 个版本

[PDF] arxiv.org

Meta-KD: A meta knowledge distillation framework for language model compression across domains

H Pan, C Wang, M Qiu, Y Zhang, Y Li… - arXiv preprint arXiv …, 2020 - arxiv.org

Pre-trained language models have been applied to various NLP tasks with considerable
performance gains. However, the large model sizes, together with the long inference time …

被引用次数：48 相关文章所有 8 个版本

[PDF] arxiv.org

Homodistil: Homotopic task-agnostic distillation of pre-trained transformers

C Liang, H Jiang, Z Li, X Tang, B Yin, T Zhao - arXiv preprint arXiv …, 2023 - arxiv.org

Knowledge distillation has been shown to be a powerful model compression approach to
facilitate the deployment of pre-trained language models in practice. This paper focuses on …

被引用次数：31 相关文章所有 4 个版本

[PDF] arxiv.org

Moebert: from bert to mixture-of-experts via importance-guided adaptation

S Zuo, Q Zhang, C Liang, P He, T Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org

Pre-trained language models have demonstrated superior performance in various natural
language processing tasks. However, these models usually contain hundreds of millions of …

被引用次数：52 相关文章所有 5 个版本

[PDF] arxiv.org

XtremeDistil: Multi-stage distillation for massive multilingual models

S Mukherjee, A Awadallah - arXiv preprint arXiv:2004.05686, 2020 - arxiv.org

Deep and large pre-trained language models are the state-of-the-art for various natural
language processing tasks. However, the huge size of these models could be a deterrent to …

被引用次数：58 相关文章所有 7 个版本

[PDF] arxiv.org

Dynamic knowledge distillation for pre-trained language models

L Li, Y Lin, S Ren, P Li, J Zhou, X Sun - arXiv preprint arXiv:2109.11295, 2021 - arxiv.org

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-
trained language models. However, existing methods conduct KD statically, eg, the student …

被引用次数：45 相关文章所有 4 个版本

高级搜索

QQ 群

Patient knowledge distillation for bert model compression

Tinybert: Distilling bert for natural language understanding

BERT-EMD: Many-to-many layer mapping for BERT compression with earth mover's distance

One teacher is enough? pre-trained language model distillation from multiple teachers

Ladabert: Lightweight adaptation of bert through hybrid model compression

Meta-KD: A meta knowledge distillation framework for language model compression across domains

Homodistil: Homotopic task-agnostic distillation of pre-trained transformers

Moebert: from bert to mixture-of-experts via importance-guided adaptation

XtremeDistil: Multi-stage distillation for massive multilingual models

Dynamic knowledge distillation for pre-trained language models

相关搜索

引用