相关文章- 学术资源搜索

Understanding and analyzing interconnect errors and network congestion on a large scale HPC system

M Kumar, S Gupta, T Patel, M Wilder… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org

Today's High Performance Computing (HPC) systems are capable of delivering
performance in the order of petaflops due to the fast computing devices, network …

被引用次数：17 相关文章所有 9 个版本

[PDF] bu.edu

Quantifying the impact of network congestion on application performance and network metrics

Y Zhang, T Groves, B Cook, NJ Wright… - … on Cluster Computing …, 2020 - ieeexplore.ieee.org

In modern high-performance computing (HPC) systems, network congestion is an important
factor that contributes to performance degradation. However, how network congestion …

被引用次数：14 相关文章所有 3 个版本

[PDF] sciencedirect.com

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system

M Kumar, S Gupta, T Patel, M Wilder, W Shi… - Journal of Parallel and …, 2021 - Elsevier

Abstract Today's High Performance Computing (HPC) systems contain thousand of nodes
which work together to provide performance in the order of petaflops. The performance of …

被引用次数：3 相关文章所有 8 个版本

OpenStack cloud tuning for high performance computing

P Ivanovic, H Richter - … on Cloud Computing and Big Data …, 2018 - ieeexplore.ieee.org

High-Performance computing (HPC) is scarcely attempted in clouds because of slow and
inefficient Inter-VM communication on the same server as well as huge latency between …

被引用次数：9 相关文章

[PDF] osti.gov

Analyzing the impact of system reliability events on applications in the Titan supercomputer

RA Ashraf, C Engelmann - 2018 IEEE/ACM 8th Workshop on …, 2018 - ieeexplore.ieee.org

Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS)
mechanisms and infrastructure to log events from multiple system components. In this paper …

被引用次数：7 相关文章所有 8 个版本

[PDF] google.com

Toward an in-depth analysis of multifidelity high performance computing systems

S Shilpika, B Lusch, M Emani, F Simini… - 2022 22nd IEEE …, 2022 - ieeexplore.ieee.org

To maintain a robust and reliable supercomputing facility, monitoring it and understanding
its hardware system events and behaviors is an essential task. Exascale systems will be …

被引用次数：3 相关文章所有 2 个版本

Workload imbalance in hpc applications: Effect on performance of in-network processing

P Haghi, A Guo, T Geng, A Skjellum… - 2021 IEEE High …, 2021 - ieeexplore.ieee.org

As HPC systems advance to exascale, communication networks are becoming ever more
complex including, eg, support for in-network processing. While critical in facilitating …

被引用次数：15 相关文章

[PDF] ieee.org

Resiliency of hpc interconnects: A case study of interconnect failures and recovery in blue waters

S Jha, V Formicola, C Di Martino… - … on Dependable and …, 2017 - ieeexplore.ieee.org

Availability of the interconnection network in high-performance computing (HPC) systems is
fundamental to sustaining the continuous execution of applications at scale. When failures …

被引用次数：21 相关文章所有 5 个版本

[PDF] arxiv.org

A study of network congestion in two supercomputing high-speed interconnects

S Jha, A Patke, J Brandt, A Gentile… - … IEEE Symposium on …, 2019 - ieeexplore.ieee.org

Network congestion in high-speed interconnects is a major source of application runtime
performance variation. Recent years have witnessed a surge of interest from both academia …

被引用次数：12 相关文章所有 9 个版本

[PDF] researchgate.net

MELA: A visual analytics tool for studying multifidelity hpc system logs

FNU Shilpika, B Lusch, M Emani… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org

To maintain a robust and reliable supercomputing hardware system there is a critical need
to understand various system events, including failures occurring in the system. Toward this …

被引用次数：19 相关文章所有 4 个版本

高级搜索

QQ 群