Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

N Jouppi, G Kurian, S Li, P Ma, R Nagarajan… - Proceedings of the 50th …, 2023 - dl.acm.org
In response to innovations in machine learning (ML) models, production workloads changed
radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its …

A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

{CASSINI}:{Network-Aware} Job Scheduling in Machine Learning Clusters

S Rajasekaran, M Ghobadi, A Akella - 21st USENIX Symposium on …, 2024 - usenix.org
We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters.
CASSINI introduces a novel geometric abstraction to consider the communication pattern of …

Lightwave fabrics: at-scale optical circuit switching for datacenter and machine learning systems

H Liu, R Urata, K Yasumura, X Zhou… - Proceedings of the …, 2023 - dl.acm.org
We describe our experience developing what we believe to be the world's first large-scale
production deployments of lightwave fabrics used for both datacenter networking and …

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

F Liang, Z Zhang, H Lu, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …

Congestion control in machine learning clusters

S Rajasekaran, M Ghobadi, G Kumar… - Proceedings of the 21st …, 2022 - dl.acm.org
This paper argues that fair-sharing, the holy grail of congestion control algorithms for
decades, is not necessarily a desirable property in Machine Learning (ML) training clusters …

Credence: Augmenting Datacenter Switch Buffer Sharing with {ML} Predictions

V Addanki, M Pacut, S Schmid - 21st USENIX Symposium on Networked …, 2024 - usenix.org
Packet buffers in datacenter switches are shared across all the switch ports in order to
improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches …

Peta-scale embedded photonics architecture for distributed deep learning applications

Z Wu, LY Dai, A Novick, M Glick, Z Zhu… - Journal of Lightwave …, 2023 - ieeexplore.ieee.org
As Deep Learning (DL) models grow larger and more complex, training jobs are
increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs …

A Holistic View of AI-driven Network Incident Management

P Hamadanian, B Arzani, S Fouladi… - Proceedings of the …, 2023 - dl.acm.org
We discuss the potential improvement large language models (LLM) can provide in incident
management and how they can overhaul the ways operators conduct incident management …

Swing: Short-cutting Rings for Higher Bandwidth Allreduce

D De Sensi, T Bonato, D Saam, T Hoefler - 21st USENIX Symposium on …, 2024 - usenix.org
The allreduce collective operation accounts for a significant fraction of the runtime of
workloads running on distributed systems. One factor determining its performance is the …