Towards distributed software resilience in asynchronous many-task programming models

N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Enabling resilience in asynchronous many-task programming models

SR Paul, A Hayashi, N Slattengren, H Kolla… - Euro-Par 2019: Parallel …, 2019 - Springer
Resilience is an imminent issue for next-generation platforms due to projected increases in
soft/transient failures as part of the inherent trade-offs among performance, energy, and …

Implementing software resiliency in hpx for extreme scale computing

N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Integrating inter-node communication with a resilient asynchronous many-task runtime system

SR Paul, A Hayashi, M Whitlock, S Bak… - 2020 Workshop on …, 2020 - ieeexplore.ieee.org
Achieving fault tolerance is one of the significant challenges of exascale computing due to
projected increases in soft/transient failures. While past work on software-based resilience …

Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems

J Posner - 2021 - kobra.uni-kassel.de
Abstract High-Performance Computing (HPC) enables solving complex problems from
various scientific fields including key societal problems such as COVID-19. Recently …

Asymmetric resilience: Exploiting task-level idempotency for transient error recovery in accelerator-based systems

J Leng, A Buyuktosunoglu, R Bertran… - … Symposium on High …, 2020 - ieeexplore.ieee.org
Accelerators make the task of building systems that are re-silient against transient errors like
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …

Towards high performance resilience using performance portable abstractions

N Morales, K Teranishi, B Nicolae, C Trott… - Euro-Par 2021: Parallel …, 2021 - Springer
In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels
places a major development burden on HPC applications. To this end, performance portable …

[PDF][PDF] Leveraging a task-based asynchronous dataflow substrate for efficient and scalable resiliency

O Subasi, J Arias, J Labarta, O Unsal… - Workshop on …, 2014 - dpss.inesc-id.pt
Leveraging a Task-based Asynchronous Dataflow Substrate for Efficient and Scalable
Resiliency Page 1 Leveraging a Task-based Asynchronous Dataflow Substrate for Efficient and …

Supervised workpools for reliable massively parallel computing

R Stewart, P Trinder, P Maier - … Symposium, TFP 2012, St. Andrews, UK …, 2013 - Springer
The manycore revolution is steadily increasing the performance and size of massively
parallel systems, to the point where system reliability becomes a pressing concern …

Extreme-scale viability of collective communication for resilient task scheduling and work stealing

J Wilke, J Bennett, H Kolla, K Teranishi… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
Extreme-scale computing will bring significant changes to high performance computing
system architectures. In particular, the increased number of system components is creating a …