Understanding and analyzing interconnect errors and network congestion on a large scale HPC system

M Kumar, S Gupta, T Patel, M Wilder… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org
Today's High Performance Computing (HPC) systems are capable of delivering
performance in the order of petaflops due to the fast computing devices, network …

Quantifying the impact of network congestion on application performance and network metrics

Y Zhang, T Groves, B Cook, NJ Wright… - … on Cluster Computing …, 2020 - ieeexplore.ieee.org
In modern high-performance computing (HPC) systems, network congestion is an important
factor that contributes to performance degradation. However, how network congestion …

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system

M Kumar, S Gupta, T Patel, M Wilder, W Shi… - Journal of Parallel and …, 2021 - Elsevier
Abstract Today's High Performance Computing (HPC) systems contain thousand of nodes
which work together to provide performance in the order of petaflops. The performance of …

OpenStack cloud tuning for high performance computing

P Ivanovic, H Richter - … on Cloud Computing and Big Data …, 2018 - ieeexplore.ieee.org
High-Performance computing (HPC) is scarcely attempted in clouds because of slow and
inefficient Inter-VM communication on the same server as well as huge latency between …

Analyzing the impact of system reliability events on applications in the Titan supercomputer

RA Ashraf, C Engelmann - 2018 IEEE/ACM 8th Workshop on …, 2018 - ieeexplore.ieee.org
Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS)
mechanisms and infrastructure to log events from multiple system components. In this paper …

Toward an in-depth analysis of multifidelity high performance computing systems

S Shilpika, B Lusch, M Emani, F Simini… - 2022 22nd IEEE …, 2022 - ieeexplore.ieee.org
To maintain a robust and reliable supercomputing facility, monitoring it and understanding
its hardware system events and behaviors is an essential task. Exascale systems will be …

Workload imbalance in hpc applications: Effect on performance of in-network processing

P Haghi, A Guo, T Geng, A Skjellum… - 2021 IEEE High …, 2021 - ieeexplore.ieee.org
As HPC systems advance to exascale, communication networks are becoming ever more
complex including, eg, support for in-network processing. While critical in facilitating …

Resiliency of hpc interconnects: A case study of interconnect failures and recovery in blue waters

S Jha, V Formicola, C Di Martino… - … on Dependable and …, 2017 - ieeexplore.ieee.org
Availability of the interconnection network in high-performance computing (HPC) systems is
fundamental to sustaining the continuous execution of applications at scale. When failures …

A study of network congestion in two supercomputing high-speed interconnects

S Jha, A Patke, J Brandt, A Gentile… - … IEEE Symposium on …, 2019 - ieeexplore.ieee.org
Network congestion in high-speed interconnects is a major source of application runtime
performance variation. Recent years have witnessed a surge of interest from both academia …

MELA: A visual analytics tool for studying multifidelity hpc system logs

FNU Shilpika, B Lusch, M Emani… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org
To maintain a robust and reliable supercomputing hardware system there is a critical need
to understand various system events, including failures occurring in the system. Toward this …