Measuring Congestion in {High-Performance} Datacenter Interconnects

Y Zhang, Y Liu, Q Meng, F Ren - … of the 2021 ACM SIGCOMM 2021 …, 2021 - dl.acm.org

Congestion detection is the cornerstone of end-to-end congestion control. Through in-depth
observations and understandings, we reveal that existing congestion detection mechanisms …

被引用次数：31 相关文章所有 3 个版本

[HTML] sciencedirect.com

[HTML][HTML] Canary: Congestion-aware in-network allreduce using dynamic trees

D De Sensi, EC Molero, S Di Girolamo… - Future Generation …, 2024 - Elsevier

The allreduce operation is an essential building block for many distributed applications,
ranging from the training of deep learning models to scientific computing. In an allreduce …

被引用次数：1 相关文章所有 7 个版本

[HTML] sciencedirect.com

[HTML][HTML] Traffic generation for benchmarking data centre networks

CWF Parsonson, JL Benjamin, G Zervas - Optical Switching and …, 2022 - Elsevier

Benchmarking is commonly used in research fields, such as computer architecture design
and machine learning, as a powerful paradigm for rigorously assessing, comparing, and …

被引用次数：14 相关文章所有 5 个版本

Revisiting congestion detection in lossless networks

Y Zhang, Q Meng, Y Liu, F Ren - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org

Congestion detection is the cornerstone of end-to-end congestion control. Through in-depth
observations and understandings, we reveal that existing congestion detection mechanisms …

被引用次数：5 相关文章所有 3 个版本

[PDF] nsf.gov

Live forensics for HPC systems: A case study on distributed storage systems

S Jha, S Cui, SS Banerjee, T Xu, J Enos… - … Conference for High …, 2020 - ieeexplore.ieee.org

Large-scale high-performance computing systems frequently experience a wide range of
failure modes, such as reliability failures (eg, hang or crash), and resource overload-related …

被引用次数：16 相关文章所有 8 个版本

[PDF] purdue.edu

The mystery of the failing jobs: Insights from operational data from two university-wide computing systems

R Kumar, S Jha, A Mahgoub… - 2020 50th Annual …, 2020 - ieeexplore.ieee.org

Node downtime and failed jobs in a computing cluster translate into wasted resources and
user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is …

被引用次数：13 相关文章所有 12 个版本

[PDF] acm.org

NegotiaToR: Towards A Simple Yet Effective On-demand Reconfigurable Datacenter Network

C Liang, X Song, J Cheng, M Wang, Y Liu… - Proceedings of the …, 2024 - dl.acm.org

Recent advances in fast optical switching technology show promise in meeting the high
goodput and low latency requirements of datacenter networks (DCN). We present …

Evaluating hardware memory disaggregation under delay and contention

A Patke, H Qiu, S Jha, S Venugopal… - 2022 IEEE …, 2022 - ieeexplore.ieee.org

Hardware memory disaggregation is an emerging trend in datacenters that provides access
to remote memory as part of a shared pool or unused memory on machines across the …

被引用次数：4 相关文章所有 6 个版本

[PDF] arxiv.org

A study of network congestion in two supercomputing high-speed interconnects

S Jha, A Patke, J Brandt, A Gentile… - … IEEE Symposium on …, 2019 - ieeexplore.ieee.org

Network congestion in high-speed interconnects is a major source of application runtime
performance variation. Recent years have witnessed a surge of interest from both academia …

被引用次数：12 相关文章所有 9 个版本

[PDF] acm.org

Delay sensitivity-driven congestion mitigation for hpc systems

A Patke, S Jha, H Qiu, J Brandt, A Gentile… - Proceedings of the …, 2021 - dl.acm.org

Modern high-performance computing (HPC) systems concurrently execute multiple
distributed applications that contend for the high-speed network leading to congestion …

被引用次数：6 相关文章所有 4 个版本

高级搜索

QQ 群