In-network machine learning using programmable network devices: A survey

C Zheng, X Hong, D Ding, S Vargaftik… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
Machine learning is widely used to solve networking challenges, ranging from traffic
classification and anomaly detection to network configuration. However, machine learning …

Hardware-assisted machine learning in resource-constrained IoT environments for security: review and future prospective

G Kornaros - IEEE Access, 2022 - ieeexplore.ieee.org
As the Internet of Things (IoT) technology advances, billions of multidisciplinary smart
devices act in concert, rarely requiring human intervention, posing significant challenges in …

8-bit optimizers via block-wise quantization

T Dettmers, M Lewis, S Shleifer… - arXiv preprint arXiv …, 2021 - arxiv.org
Stateful optimizers maintain gradient statistics over time, eg, the exponentially smoothed
sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can …

Scaling distributed machine learning with {In-Network} aggregation

A Sapio, M Canini, CY Ho, J Nelson, P Kalnis… - … USENIX Symposium on …, 2021 - usenix.org
Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …

Floatpim: In-memory acceleration of deep neural network training with high precision

M Imani, S Gupta, Y Kim, T Rosing - Proceedings of the 46th International …, 2019 - dl.acm.org
Processing In-Memory (PIM) has shown a great potential to accelerate inference tasks of
Convolutional Neural Network (CNN). However, existing PIM architectures do not support …

Communication compression for decentralized training

H Tang, S Gan, C Zhang, T Zhang… - Advances in Neural …, 2018 - proceedings.neurips.cc
Optimizing distributed learning systems is an art of balancing between computation and
communication. There have been two lines of research that try to deal with slower …

Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point

B Darvish Rouhani, D Lo, R Zhao… - Advances in neural …, 2020 - proceedings.neurips.cc
In this paper, we explore the limits of Microsoft Floating Point (MSFP), a new class of
datatypes developed for production cloud-scale inferencing on custom hardware. Through …

Stable and low-precision training for large-scale vision-language models

M Wortsman, T Dettmers… - Advances in …, 2023 - proceedings.neurips.cc
We introduce new methods for 1) accelerating and 2) stabilizing training for large language-
vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized …

Dynamic aggregation for heterogeneous quantization in federated learning

S Chen, C Shen, L Zhang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Communication is widely known as the primary bottleneck of federated learning, and
quantization of local model updates before uploading to the parameter server is an effective …

Tensordash: Exploiting sparsity to accelerate deep neural network training

M Mahmoud, I Edo, AH Zadeh… - 2020 53rd Annual …, 2020 - ieeexplore.ieee.org
TensorDash is a hardware-based technique that enables data-parallel MAC units to take
advantage of sparsity in their input operand streams. When used to compose a hardware …