Scalable identification of load imbalance in parallel executions using call path profiles

NR Tallent, L Adhianto… - SC'10: Proceedings of …, 2010 - ieeexplore.ieee.org
Applications must scale well to make efficient use of today's class of petascale computers,
which contain hundreds of thousands of processor cores. Inefficiencies that do not even …

Visualizing network traffic to understand the performance of massively parallel simulations

AG Landge, JA Levine, A Bhatele… - … on Visualization and …, 2012 - ieeexplore.ieee.org
The performance of massively parallel applications is often heavily impacted by the cost of
communication among compute nodes. However, determining how to best use the network …

Fliptracker: Understanding natural error resilience in hpc applications

L Guo, D Li, I Laguna, M Schulz - … : International Conference for …, 2018 - ieeexplore.ieee.org
As high-performance computing systems scale in size and computational power, the danger
of silent errors, ie, errors that can bypass hardware detection mechanisms and impact …

A framework for end-to-end simulation of high-performance computing systems

WE Denzel, J Li, P Walker, Y Jin - Simulation, 2010 - journals.sagepub.com
We present an end-to-end simulation framework that is capable of simulating High-
Performance Computing (HPC) systems with hundreds of thousands of interconnected …

Optimal scheduling of in-situ analysis for large-scale scientific simulations

P Malakar, V Vishwanath, T Munson, C Knight… - Proceedings of the …, 2015 - dl.acm.org
Today's leadership computing facilities have enabled the execution of transformative
simulations at unprecedented scales. However, analyzing the huge amount of output from …

BeeSwarm: enabling parallel scaling performance measurement in continuous integration for HPC applications

J Tronge, J Chen, P Grubel, T Randles… - 2021 36th IEEE/ACM …, 2021 - ieeexplore.ieee.org
Testing is one of the most important steps in software development–it ensures the quality of
software. Continuous Integration (CI) is a widely used testing standard that can report …

Scalable fine-grained call path tracing

NR Tallent, J Mellor-Crummey, M Franco… - Proceedings of the …, 2011 - dl.acm.org
Applications must scale well to make efficient use of even medium-scale parallel systems.
Because scaling problems are often difficult to diagnose, there is a critical need for scalable …

Evaluating similarity-based trace reduction techniques for scalable performance analysis

K Mohror, KL Karavanic - Proceedings of the conference on high …, 2009 - dl.acm.org
Event traces are required to correctly diagnose a number of performance problems that arise
on today's highly parallel systems. Unfortunately, the collection of event traces can produce …

Lessons learned at 208k: towards debugging millions of cores

GL Lee, DH Ahn, DC Arnold… - SC'08: Proceedings …, 2008 - ieeexplore.ieee.org
Petascale systems will present several new challenges to performance and correctness
tools. Such machines may contain millions of cores, requiring that tools use scalable data …

A visual analytics system for optimizing communications in massively parallel applications

T Fujiwara, P Malakar, K Reda… - … IEEE Conference on …, 2017 - ieeexplore.ieee.org
Current and future supercomputers have tens of thousands of compute nodes
interconnected with high-dimensional networks and complex network topologies for …