{ATP}: In-network aggregation for multi-tenant learning

CL Lao, Y Le, K Mahajan, Y Chen, W Wu… - … USENIX Symposium on …, 2021 - usenix.org
Distributed deep neural network training (DT) systems are widely deployed in clusters where
the network is shared across multiple tenants, ie, multiple DT jobs. Each DT job computes …

Scaling distributed machine learning with {In-Network} aggregation

A Sapio, M Canini, CY Ho, J Nelson, P Kalnis… - … USENIX Symposium on …, 2021 - usenix.org
Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …

{TopoOpt}: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

W Wang, M Khazraee, Z Zhong, M Ghobadi… - … USENIX Symposium on …, 2023 - usenix.org
We propose TopoOpt, a novel direct-connect fabric for deep neural network (DNN) training
workloads. TopoOpt co-optimizes the distributed training process across three dimensions …

A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks

Y Li, J Park, M Alian, Y Yuan, Z Qu… - 2018 51st Annual …, 2018 - ieeexplore.ieee.org
Training real-world Deep Neural Networks (DNNs) can take an eon (ie, weeks or months)
without leveraging distributed systems. Even distributed training takes inordinate time, of …

Project adam: Building an efficient and scalable deep learning training system

T Chilimbi, Y Suzue, J Apacible… - 11th USENIX symposium …, 2014 - usenix.org
Large deep neural network models have recently demonstrated state-of-the-art accuracy on
hard visual recognition tasks. Unfortunately such models are extremely time consuming to …

Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning

L Zheng, Z Li, H Zhang, Y Zhuang, Z Chen… - … USENIX Symposium on …, 2022 - usenix.org
Alpa automates model-parallel training of large deep learning (DL) models by generating
execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel …

An in-network architecture for accelerating shared-memory multiprocessor collectives

B Klenk, N Jiang, G Thorson… - 2020 ACM/IEEE 47th …, 2020 - ieeexplore.ieee.org
The slowdown of single-chip performance scaling combined with the growing demands of
computing ever larger problems efficiently has led to a renewed interest in distributed …

Enabling compute-communication overlap in distributed deep learning training platforms

S Rashidi, M Denton, S Sridharan… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators
(eg, GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth …

Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product

M Zhao, N Agarwal, A Basant, B Gedik, S Pan… - Proceedings of the 49th …, 2022 - dl.acm.org
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …

In-network aggregation for shared machine learning clusters

N Gebara, M Ghobadi, P Costa - Proceedings of Machine …, 2021 - proceedings.mlsys.org
We present PANAMA, a network architecture for machine learning (ML) workloads on
shared clusters where a variety of training jobs co-exist. PANAMA consists of two key …