Impact of RoCE congestion control policies on distributed training of DNNs

Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

W Won, T Heo, S Rashidi, S Sridharan… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org

As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …

被引用次数：22 相关文章所有 4 个版本

Data center ethernet and remote direct memory access: Issues at hyperscale

T Hoefler, D Roweth, K Underwood, R Alverson… - Computer, 2023 - ieeexplore.ieee.org

Remote direct memory access (RDMA) over converged Ethernet (RoCE) was an attempt to
adopt modern RDMA features into existing Ethernet installations. We revisit RoCE's design …

被引用次数：7 相关文章所有 4 个版本

[PDF] arxiv.org

Datacenter ethernet and rdma: Issues at hyperscale

T Hoefler, D Roweth, K Underwood, B Alverson… - arXiv preprint arXiv …, 2023 - arxiv.org

We observe that emerging artificial intelligence, high-performance computing, and storage
workloads pose new challenges for large-scale datacenter networking. RDMA over …

被引用次数：10 相关文章所有 2 个版本

[PDF] acm.org

Practical Packet Deflection in Datacenters

S Abdous, E Sharafzadeh, S Ghorbani - Proceedings of the ACM on …, 2023 - dl.acm.org

Bursts, sudden surges in network utilization, are a significant root cause of packet loss and
high latency in datacenters. Packet deflection, re-routing packets that arrive at a local …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

W Won, M Elavazhagan, S Srinivasan, A Durg… - arXiv preprint arXiv …, 2023 - arxiv.org

The surge of artificial intelligence, specifically large language models, has led to a rapid
advent towards the development of large-scale machine learning training clusters …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

On the Burstiness of Distributed Machine Learning Traffic

N Luangsomboon, F Fazel, J Liebeherr… - arXiv preprint arXiv …, 2023 - arxiv.org

Traffic from distributed training of machine learning (ML) models makes up a large and
growing fraction of the traffic mix in enterprise data centers. While work on distributed ML …

GPU Cluster Scheduling for Network-Sensitive Deep Learning

A Sharma, VM Bhasi, S Singh, G Kesidis… - arXiv preprint arXiv …, 2024 - arxiv.org

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables
proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the …

被引用次数：1 相关文章所有 2 个版本

[PDF] iop.org

FG-PFC: A Fine-Grained PFC Mechanism for Lossless RDMA

S Li, C Wang, Y Zhang, C Ma, L Li… - Journal of Physics …, 2023 - iopscience.iop.org

Abstract Remote Direct Memory Access (RDMA) is widely deployed in data centers to
improve the performance, efficiency, and reliability of data centers. Priority-based Flow …

高级搜索

QQ 群