相关文章- 学术资源搜索

Automatic student network search for knowledge distillation

Z Zhang, W Zhu, J Yan, P Gao… - 2020 25th International …, 2021 - ieeexplore.ieee.org

Pre-trained language models (PLMs), such as BERT, have achieved outstanding
performance on multiple natural language processing (NLP) tasks. However, such pre …

被引用次数：9 相关文章所有 4 个版本

[PDF] aclanthology.org

Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation

Y Wu, M Rezagholizadeh, A Ghaddar… - Proceedings of the …, 2021 - aclanthology.org

Intermediate layer matching is shown as an effective approach for improving knowledge
distillation (KD). However, this technique applies matching in the hidden spaces of two …

被引用次数：28 相关文章所有 2 个版本

SDSK2BERT: Explore the specific depth with specific knowledge to compress BERT

L Ding, Y Yang - 2020 IEEE International Conference on …, 2020 - ieeexplore.ieee.org

The success of a pretraining model like BERT in Natural Language Processing (NLP) puts
forward the demand for model compression. Previous works adopting knowledge distillation …

被引用次数：3 相关文章所有 2 个版本

[PDF] aaai.org

Reinforced multi-teacher selection for knowledge distillation

F Yuan, L Shou, J Pei, W Lin, M Gong, Y Fu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org

In natural language processing (NLP) tasks, slow inference speed and huge footprints in
GPU usage remain the bottleneck of applying pre-trained deep models in production. As a …

被引用次数：132 相关文章所有 4 个版本

[PDF] arxiv.org

Revisiting intermediate layer distillation for compressing language models: An overfitting perspective

J Ko, S Park, M Jeong, S Hong, E Ahn… - arXiv preprint arXiv …, 2023 - arxiv.org

Knowledge distillation (KD) is a highly promising method for mitigating the computational
problems of pre-trained language models (PLMs). Among various KD approaches …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

Which student is best? a comprehensive knowledge distillation exam for task-specific bert models

MN Nityasya, HA Wibowo, R Chevi, RE Prasojo… - arXiv preprint arXiv …, 2022 - arxiv.org

We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher
models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small …

被引用次数：5 相关文章

[PDF] aclanthology.org

RW-KD: Sample-wise loss terms re-weighting for knowledge distillation

P Lu, A Ghaddar, A Rashid… - Findings of the …, 2021 - aclanthology.org

Abstract Knowledge Distillation (KD) is extensively used in Natural Language Processing to
compress the pre-training and task-specific fine-tuning phases of large neural language …

被引用次数：11 相关文章所有 3 个版本

[PDF] aaai.org

Adversarial data augmentation for task-specific knowledge distillation of pre-trained transformers

M Zhang, NU Naresh, Y He - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org

Deep and large pre-trained language models (eg, BERT, GPT-3) are state-of-the-art for
various natural language processing tasks. However, the huge size of these models brings …

被引用次数：14 相关文章所有 6 个版本

[PDF] arxiv.org

How to select one among all? an extensive empirical study towards the robustness of knowledge distillation in natural language understanding

T Li, A Rashid, A Jafari, P Sharma, A Ghodsi… - arXiv preprint arXiv …, 2021 - arxiv.org

Knowledge Distillation (KD) is a model compression algorithm that helps transfer the
knowledge of a large neural network into a smaller one. Even though KD has shown …

被引用次数：3 相关文章所有 5 个版本

Relay knowledge distillation for efficiently boosting the performance of shallow networks

S Fu, Z Lai, Y Zhang, Y Liu, X Yang - Neurocomputing, 2022 - Elsevier

To reduce the computational consumption and memory footprint of powerful deep neural
networks for applications on edge devices, many model compression methods have been …

被引用次数：5 相关文章所有 3 个版本

高级搜索

QQ 群

Automatic student network search for knowledge distillation

Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation

SDSK2BERT: Explore the specific depth with specific knowledge to compress BERT

Reinforced multi-teacher selection for knowledge distillation

Revisiting intermediate layer distillation for compressing language models: An overfitting perspective

Which student is best? a comprehensive knowledge distillation exam for task-specific bert models

RW-KD: Sample-wise loss terms re-weighting for knowledge distillation

Adversarial data augmentation for task-specific knowledge distillation of pre-trained transformers

How to select one among all? an extensive empirical study towards the robustness of knowledge distillation in natural language understanding

Relay knowledge distillation for efficiently boosting the performance of shallow networks

相关搜索

引用