Model compression and hardware acceleration for neural networks: A comprehensive survey

L Deng, G Li, S Han, L Shi, Y Xie - Proceedings of the IEEE, 2020 - ieeexplore.ieee.org
Domain-specific hardware is becoming a promising topic in the backdrop of improvement
slow down for general-purpose processors due to the foreseeable end of Moore's Law …

A comprehensive survey on model compression and acceleration

T Choudhary, V Mishra, A Goswami… - Artificial Intelligence …, 2020 - Springer
In recent years, machine learning (ML) and deep learning (DL) have shown remarkable
improvement in computer vision, natural language processing, stock prediction, forecasting …

A survey of quantization methods for efficient neural network inference

A Gholami, S Kim, Z Dong, Z Yao… - Low-Power Computer …, 2022 - taylorfrancis.com
This chapter provides approaches to the problem of quantizing the numerical values in deep
Neural Network computations, covering the advantages/disadvantages of current methods …

Q-bert: Hessian based ultra low precision quantization of bert

S Shen, Z Dong, J Ye, L Ma, Z Yao, A Gholami… - Proceedings of the AAAI …, 2020 - aaai.org
Transformer based architectures have become de-facto models used for a range of Natural
Language Processing tasks. In particular, the BERT based models achieved significant …

Efficient acceleration of deep learning inference on resource-constrained edge devices: A review

MMH Shuvo, SK Islam, J Cheng… - Proceedings of the …, 2022 - ieeexplore.ieee.org
Successful integration of deep neural networks (DNNs) or deep learning (DL) has resulted
in breakthroughs in many areas. However, deploying these highly accurate models for data …

Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network

S Mehta, M Rastegari, L Shapiro… - Proceedings of the …, 2019 - openaccess.thecvf.com
We introduce a light-weight, power efficient, and general purpose convolutional neural
network, ESPNetv2, for modeling visual and sequential data. Our network uses group point …

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models

G Park, B Park, M Kim, S Lee, J Kim, B Kwon… - arXiv preprint arXiv …, 2022 - arxiv.org
The recent advancements in self-supervised learning, combined with the Transformer
architecture, have enabled natural language processing (NLP) to achieve remarkably low …

Understanding and overcoming the challenges of efficient transformer quantization

Y Bondarenko, M Nagel, T Blankevoort - arXiv preprint arXiv:2109.12948, 2021 - arxiv.org
Transformer-based architectures have become the de-facto standard models for a wide
range of Natural Language Processing tasks. However, their memory footprint and high …

A survey on methods and theories of quantized neural networks

Y Guo - arXiv preprint arXiv:1808.04752, 2018 - arxiv.org
Deep neural networks are the state-of-the-art methods for many real-world tasks, such as
computer vision, natural language processing and speech recognition. For all its popularity …

Compression of deep learning models for text: A survey

M Gupta, P Agrawal - ACM Transactions on Knowledge Discovery from …, 2022 - dl.acm.org
In recent years, the fields of natural language processing (NLP) and information retrieval (IR)
have made tremendous progress thanks to deep learning models like Recurrent Neural …