- 学术资源搜索

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

N Jouppi, G Kurian, S Li, P Ma, R Nagarajan… - Proceedings of the 50th …, 2023 - dl.acm.org

In response to innovations in machine learning (ML) models, production workloads changed
radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its …

被引用次数：203 相关文章所有 6 个版本

A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org

The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

被引用次数：4 相关文章所有 2 个版本

[PDF] usenix.org

{CASSINI}:{Network-Aware} Job Scheduling in Machine Learning Clusters

S Rajasekaran, M Ghobadi, A Akella - 21st USENIX Symposium on …, 2024 - usenix.org

We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters.
CASSINI introduces a novel geometric abstraction to consider the communication pattern of …

被引用次数：10 相关文章所有 5 个版本

[PDF] acm.org

Lightwave fabrics: at-scale optical circuit switching for datacenter and machine learning systems

H Liu, R Urata, K Yasumura, X Zhou… - Proceedings of the …, 2023 - dl.acm.org

We describe our experience developing what we believe to be the world's first large-scale
production deployments of lightwave fabrics used for both datacenter networking and …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

F Liang, Z Zhang, H Lu, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …

被引用次数：2 相关文章所有 2 个版本

[PDF] acm.org

Congestion control in machine learning clusters

S Rajasekaran, M Ghobadi, G Kumar… - Proceedings of the 21st …, 2022 - dl.acm.org

This paper argues that fair-sharing, the holy grail of congestion control algorithms for
decades, is not necessarily a desirable property in Machine Learning (ML) training clusters …

被引用次数：20 相关文章所有 4 个版本

[PDF] usenix.org

Credence: Augmenting Datacenter Switch Buffer Sharing with {ML} Predictions

V Addanki, M Pacut, S Schmid - 21st USENIX Symposium on Networked …, 2024 - usenix.org

Packet buffers in datacenter switches are shared across all the switch ports in order to
improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches …

被引用次数：4 相关文章所有 6 个版本

[PDF] ieee.org

Peta-scale embedded photonics architecture for distributed deep learning applications

Z Wu, LY Dai, A Novick, M Glick, Z Zhu… - Journal of Lightwave …, 2023 - ieeexplore.ieee.org

As Deep Learning (DL) models grow larger and more complex, training jobs are
increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs …

被引用次数：9 相关文章所有 8 个版本

[PDF] acm.org

A Holistic View of AI-driven Network Incident Management

P Hamadanian, B Arzani, S Fouladi… - Proceedings of the …, 2023 - dl.acm.org

We discuss the potential improvement large language models (LLM) can provide in incident
management and how they can overhaul the ways operators conduct incident management …

被引用次数：5 相关文章所有 4 个版本

[PDF] usenix.org

Swing: Short-cutting Rings for Higher Bandwidth Allreduce

D De Sensi, T Bonato, D Saam, T Hoefler - 21st USENIX Symposium on …, 2024 - usenix.org

The allreduce collective operation accounts for a significant fraction of the runtime of
workloads running on distributed systems. One factor determining its performance is the …

被引用次数：4 相关文章所有 5 个版本

高级搜索

QQ 群

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

A survey on scheduling techniques in computing and network convergence

{CASSINI}:{Network-Aware} Job Scheduling in Machine Learning Clusters

Lightwave fabrics: at-scale optical circuit switching for datacenter and machine learning systems

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

Congestion control in machine learning clusters

Credence: Augmenting Datacenter Switch Buffer Sharing with {ML} Predictions

Peta-scale embedded photonics architecture for distributed deep learning applications

A Holistic View of AI-driven Network Incident Management

Swing: Short-cutting Rings for Higher Bandwidth Allreduce

引用