Pre-trained models for natural language processing: A survey

X Qiu, T Sun, Y Xu, Y Shao, N Dai, X Huang - Science China …, 2020 - Springer
Recently, the emergence of pre-trained models (PTMs) has brought natural language
processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs …

[PDF][PDF] 知识蒸馏研究综述

黄震华, 杨顺志, 林威, 倪娟, 孙圣力, 陈运文, 汤庸 - 计算机学报, 2022 - 159.226.43.17
摘要高性能的深度学习网络通常是计算型和参数密集型的, 难以应用于资源受限的边缘设备.
为了能够在低资源设备上运行深度学习模型, 需要研发高效的小规模网络 …

Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

W Wang, F Wei, L Dong, H Bao… - Advances in Neural …, 2020 - proceedings.neurips.cc
Pre-trained language models (eg, BERT (Devlin et al., 2018) and its variants) have achieved
remarkable success in varieties of NLP tasks. However, these models usually consist of …

Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers

W Wang, H Bao, S Huang, L Dong, F Wei - arXiv preprint arXiv …, 2020 - arxiv.org
We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-
attention relation distillation for task-agnostic compression of pretrained Transformers. In …

Dynabert: Dynamic bert with adaptive width and depth

L Hou, Z Huang, L Shang, X Jiang… - Advances in Neural …, 2020 - proceedings.neurips.cc
The pre-trained language models like BERT, though powerful in many natural language
processing tasks, are both computation and memory expensive. To alleviate this problem …

Dnnfusion: accelerating deep neural networks execution with advanced operator fusion

W Niu, J Guan, Y Wang, G Agrawal, B Ren - Proceedings of the 42nd …, 2021 - dl.acm.org
Deep Neural Networks (DNNs) have emerged as the core enabler of many major
applications on mobile devices. To achieve high accuracy, DNN models have become …

SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

FN Iandola, AE Shaw, R Krishna… - arXiv preprint arXiv …, 2020 - arxiv.org
Humans read and write hundreds of billions of messages every day. Further, due to the
availability of large datasets, large computing systems, and better neural network models …

On the effect of dropping layers of pre-trained transformer models

H Sajjad, F Dalvi, N Durrani, P Nakov - Computer Speech & Language, 2023 - Elsevier
Transformer-based NLP models are trained using hundreds of millions or even billions of
parameters, limiting their applicability in computationally constrained environments. While …

Pre-trained embeddings for entity resolution: an experimental analysis

A Zeakis, G Papadakis, D Skoutas… - Proceedings of the VLDB …, 2023 - dl.acm.org
Many recent works on Entity Resolution (ER) leverage Deep Learning techniques involving
language models to improve effectiveness. This is applied to both main steps of ER, ie …

[PDF][PDF] 深度学习中知识蒸馏研究综述

邵仁荣, 刘宇昂, 张伟, 王骏 - 计算机学报, 2022 - 159.226.43.17
摘要在人工智能迅速发展的今天, 深度神经网络广泛应用于各个研究领域并取得了巨大的成功,
但也同样面临着诸多挑战. 首先, 为了解决复杂的问题和提高模型的训练效果 …