Model compression and hardware acceleration for neural networks: A comprehensive survey

L Deng, G Li, S Han, L Shi, Y Xie - Proceedings of the IEEE, 2020 - ieeexplore.ieee.org
Domain-specific hardware is becoming a promising topic in the backdrop of improvement
slow down for general-purpose processors due to the foreseeable end of Moore's Law …

Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks

T Hoefler, D Alistarh, T Ben-Nun, N Dryden… - Journal of Machine …, 2021 - jmlr.org
The growing energy and performance costs of deep learning have driven the community to
reduce the size of neural networks by selectively pruning components. Similarly to their …

Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training

E Qin, A Samajdar, H Kwon, V Nadella… - … Symposium on High …, 2020 - ieeexplore.ieee.org
The advent of Deep Learning (DL) has radically transformed the computing industry across
the entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it …

SPINN: synergistic progressive inference of neural networks over device and cloud

S Laskaridis, SI Venieris, M Almeida… - Proceedings of the 26th …, 2020 - dl.acm.org
Despite the soaring use of convolutional neural networks (CNNs) in mobile applications,
uniformly sustaining high-performance inference on mobile has been elusive due to the …

Machine learning at facebook: Understanding inference at the edge

CJ Wu, D Brooks, K Chen, D Chen… - … symposium on high …, 2019 - ieeexplore.ieee.org
At Facebook, machine learning provides a wide range of capabilities that drive many
aspects of user experience including ranking posts, content understanding, object detection …

Inducing and exploiting activation sparsity for fast inference on deep neural networks

M Kurtz, J Kopinsky, R Gelashvili… - International …, 2020 - proceedings.mlr.press
Optimizing convolutional neural networks for fast inference has recently become an
extremely active area of research. One of the go-to solutions in this context is weight …

Recnmp: Accelerating personalized recommendation with near-memory processing

L Ke, U Gupta, BY Cho, D Brooks… - 2020 ACM/IEEE 47th …, 2020 - ieeexplore.ieee.org
Personalized recommendation systems leverage deep learning models and account for the
majority of data center AI cycles. Their performance is dominated by memory-bound sparse …

Resource-efficient convolutional networks: A survey on model-, arithmetic-, and implementation-level techniques

JK Lee, L Mukhanov, AS Molahosseini… - ACM Computing …, 2023 - dl.acm.org
Convolutional neural networks (CNNs) are used in our daily life, including self-driving cars,
virtual assistants, social network services, healthcare services, and face recognition, among …

Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning

Y Kwon, Y Lee, M Rhu - Proceedings of the 52nd Annual IEEE/ACM …, 2019 - dl.acm.org
Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-
intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper …

The lazy neuron phenomenon: On emergence of activation sparsity in transformers

Z Li, C You, S Bhojanapalli, D Li, AS Rawat… - arXiv preprint arXiv …, 2022 - arxiv.org
This paper studies the curious phenomenon for machine learning models with Transformer
architectures that their activation maps are sparse. By activation map we refer to the …