Congestion detection in lossless networks

Y Zhang, Y Liu, Q Meng, F Ren - … of the 2021 ACM SIGCOMM 2021 …, 2021 - dl.acm.org
Congestion detection is the cornerstone of end-to-end congestion control. Through in-depth
observations and understandings, we reveal that existing congestion detection mechanisms …

[HTML][HTML] Canary: Congestion-aware in-network allreduce using dynamic trees

D De Sensi, EC Molero, S Di Girolamo… - Future Generation …, 2024 - Elsevier
The allreduce operation is an essential building block for many distributed applications,
ranging from the training of deep learning models to scientific computing. In an allreduce …

[HTML][HTML] Traffic generation for benchmarking data centre networks

CWF Parsonson, JL Benjamin, G Zervas - Optical Switching and …, 2022 - Elsevier
Benchmarking is commonly used in research fields, such as computer architecture design
and machine learning, as a powerful paradigm for rigorously assessing, comparing, and …

Revisiting congestion detection in lossless networks

Y Zhang, Q Meng, Y Liu, F Ren - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org
Congestion detection is the cornerstone of end-to-end congestion control. Through in-depth
observations and understandings, we reveal that existing congestion detection mechanisms …

Live forensics for HPC systems: A case study on distributed storage systems

S Jha, S Cui, SS Banerjee, T Xu, J Enos… - … Conference for High …, 2020 - ieeexplore.ieee.org
Large-scale high-performance computing systems frequently experience a wide range of
failure modes, such as reliability failures (eg, hang or crash), and resource overload-related …

The mystery of the failing jobs: Insights from operational data from two university-wide computing systems

R Kumar, S Jha, A Mahgoub… - 2020 50th Annual …, 2020 - ieeexplore.ieee.org
Node downtime and failed jobs in a computing cluster translate into wasted resources and
user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is …

NegotiaToR: Towards A Simple Yet Effective On-demand Reconfigurable Datacenter Network

C Liang, X Song, J Cheng, M Wang, Y Liu… - Proceedings of the …, 2024 - dl.acm.org
Recent advances in fast optical switching technology show promise in meeting the high
goodput and low latency requirements of datacenter networks (DCN). We present …

Evaluating hardware memory disaggregation under delay and contention

A Patke, H Qiu, S Jha, S Venugopal… - 2022 IEEE …, 2022 - ieeexplore.ieee.org
Hardware memory disaggregation is an emerging trend in datacenters that provides access
to remote memory as part of a shared pool or unused memory on machines across the …

A study of network congestion in two supercomputing high-speed interconnects

S Jha, A Patke, J Brandt, A Gentile… - … IEEE Symposium on …, 2019 - ieeexplore.ieee.org
Network congestion in high-speed interconnects is a major source of application runtime
performance variation. Recent years have witnessed a surge of interest from both academia …

Delay sensitivity-driven congestion mitigation for hpc systems

A Patke, S Jha, H Qiu, J Brandt, A Gentile… - Proceedings of the …, 2021 - dl.acm.org
Modern high-performance computing (HPC) systems concurrently execute multiple
distributed applications that contend for the high-speed network leading to congestion …