The allreduce operation is an essential building block for many distributed applications, ranging from the training of deep learning models to scientific computing. In an allreduce …
Benchmarking is commonly used in research fields, such as computer architecture design and machine learning, as a powerful paradigm for rigorously assessing, comparing, and …
Y Zhang, Q Meng, Y Liu, F Ren - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org
Congestion detection is the cornerstone of end-to-end congestion control. Through in-depth observations and understandings, we reveal that existing congestion detection mechanisms …
Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (eg, hang or crash), and resource overload-related …
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is …
C Liang, X Song, J Cheng, M Wang, Y Liu… - Proceedings of the …, 2024 - dl.acm.org
Recent advances in fast optical switching technology show promise in meeting the high goodput and low latency requirements of datacenter networks (DCN). We present …
A Patke, H Qiu, S Jha, S Venugopal… - 2022 IEEE …, 2022 - ieeexplore.ieee.org
Hardware memory disaggregation is an emerging trend in datacenters that provides access to remote memory as part of a shared pool or unused memory on machines across the …
S Jha, A Patke, J Brandt, A Gentile… - … IEEE Symposium on …, 2019 - ieeexplore.ieee.org
Network congestion in high-speed interconnects is a major source of application runtime performance variation. Recent years have witnessed a surge of interest from both academia …
Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion …