Implementing software resiliency in hpx for extreme scale computing

N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Towards distributed software resilience in asynchronous many-task programming models

N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Enabling resilience in asynchronous many-task programming models

SR Paul, A Hayashi, N Slattengren, H Kolla… - Euro-Par 2019: Parallel …, 2019 - Springer
Resilience is an imminent issue for next-generation platforms due to projected increases in
soft/transient failures as part of the inherent trade-offs among performance, energy, and …

Resilience-aware resource management for exascale computing systems

D Dauwe, S Pasricha, AA Maciejewski… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
With the increases in complexity and number of nodes in large-scale high performance
computing (HPC) systems over time, the probability of applications experiencing runtime …

Integrating inter-node communication with a resilient asynchronous many-task runtime system

SR Paul, A Hayashi, M Whitlock, S Bak… - 2020 Workshop on …, 2020 - ieeexplore.ieee.org
Achieving fault tolerance is one of the significant challenges of exascale computing due to
projected increases in soft/transient failures. While past work on software-based resilience …

Towards high performance resilience using performance portable abstractions

N Morales, K Teranishi, B Nicolae, C Trott… - Euro-Par 2021: Parallel …, 2021 - Springer
In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels
places a major development burden on HPC applications. To this end, performance portable …

Asking the right questions: benchmarking fault-tolerant extreme-scale systems

PM Widener, KB Ferreira, S Levy, PG Bridges… - Euro-Par 2013: Parallel …, 2014 - Springer
Much recent research has explored fault-tolerance mechanisms intended for current and
future extreme-scale systems. Evaluations of the suitability of checkpoint-based solutions …

Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs

C Di Martino, W Kramer, Z Kalbarczyk… - 2015 45th Annual IEEE …, 2015 - ieeexplore.ieee.org
This paper presents an in-depth characterization of the resiliency of more than 5 million HPC
application runs completed during the first 518 production days of Blue Waters, a 13.1 …

Using simulation to evaluate the performance of resilience strategies at scale

S Levy, B Topp, KB Ferreira, D Arnold, T Hoefler… - … and Simulation: 4th …, 2014 - Springer
Fault-tolerance has been identified as a major challenge for future extreme-scale systems.
Current predictions suggest that, as systems grow in size, failures will occur more frequently …

A programming model for resilience in extreme scale computing

S Hukerikar, PC Diniz, RF Lucas - IEEE/IFIP International …, 2012 - ieeexplore.ieee.org
System resilience is an important challenge that needs to be addressed in the era of
extreme scale computing. Exascale supercomputers will be architected using millions of …