G Kornaros - IEEE Access, 2022 - ieeexplore.ieee.org
As the Internet of Things (IoT) technology advances, billions of multidisciplinary smart devices act in concert, rarely requiring human intervention, posing significant challenges in …
Stateful optimizers maintain gradient statistics over time, eg, the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can …
Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a …
Processing In-Memory (PIM) has shown a great potential to accelerate inference tasks of Convolutional Neural Network (CNN). However, existing PIM architectures do not support …
Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower …
In this paper, we explore the limits of Microsoft Floating Point (MSFP), a new class of datatypes developed for production cloud-scale inferencing on custom hardware. Through …
We introduce new methods for 1) accelerating and 2) stabilizing training for large language- vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized …
S Chen, C Shen, L Zhang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Communication is widely known as the primary bottleneck of federated learning, and quantization of local model updates before uploading to the parameter server is an effective …
M Mahmoud, I Edo, AH Zadeh… - 2020 53rd Annual …, 2020 - ieeexplore.ieee.org
TensorDash is a hardware-based technique that enables data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware …