Offloading machine learning to programmable data planes: A systematic survey

R Parizotto, BL Coelho, DC Nunes, I Haque… - ACM Computing …, 2023 - dl.acm.org
The demand for machine learning (ML) has increased significantly in recent decades,
enabling several applications, such as speech recognition, computer vision, and …

Overlap communication with dependent computation via decomposition in large deep learning models

S Wang, J Wei, A Sabne, A Davis, B Ilbeyi… - Proceedings of the 28th …, 2022 - dl.acm.org
Large deep learning models have shown great potential with state-of-the-art results in many
tasks. However, running these large models is quite challenging on an accelerator (GPU or …

Galvatron: Efficient transformer training over multiple gpus using automatic parallelism

X Miao, Y Wang, Y Jiang, C Shi, X Nie, H Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org
Transformer models have achieved state-of-the-art performance on various domains of
applications and gradually becomes the foundations of the advanced large deep learning …

Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

W Won, T Heo, S Rashidi, S Sridharan… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org
As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …

Congestion control in machine learning clusters

S Rajasekaran, M Ghobadi, G Kumar… - Proceedings of the 21st …, 2022 - dl.acm.org
This paper argues that fair-sharing, the holy grail of congestion control algorithms for
decades, is not necessarily a desirable property in Machine Learning (ML) training clusters …

HammingMesh: a network topology for large-scale deep learning

T Hoefler, T Bonato, D De Sensi… - … Conference for High …, 2022 - ieeexplore.ieee.org
Numerous microarchitectural optimizations unlocked tremendous processing power for
deep neural networks that in turn fueled the AI revolution. With the exhaustion of such …

Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models

S Rashidi, W Won, S Srinivasan, S Sridharan… - Proceedings of the 49th …, 2022 - dl.acm.org
Distributed training is a solution to reduce DNN training time by splitting the task across
multiple NPUs (eg, GPU/TPU). However, distributed training adds communication overhead …

Peta-scale embedded photonics architecture for distributed deep learning applications

Z Wu, LY Dai, A Novick, M Glick, Z Zhu… - Journal of Lightwave …, 2023 - ieeexplore.ieee.org
As Deep Learning (DL) models grow larger and more complex, training jobs are
increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs …

Codg-reram: An algorithm-hardware co-design to accelerate semi-structured gnns on reram

Y Luo, P Behnam, K Thorat, Z Liu… - 2022 IEEE 40th …, 2022 - ieeexplore.ieee.org
Graph Neural Networks (GCNs) have attracted wide attention and are applied to the real
world. However, due to the ever-growing graph data with significant irregularities, off-chip …

Logical/physical topology-aware collective communication in deep learning training

S Cho, H Son, J Kim - 2023 IEEE International Symposium on …, 2023 - ieeexplore.ieee.org
Training is an important aspect of deep learning to enable network models to be deployed.
To scale training, multiple GPUs are commonly used with data parallelism to exploit the …