Integrating inter-node communication with a resilient asynchronous many-task runtime system

SR Paul, A Hayashi, M Whitlock, S Bak… - 2020 Workshop on …, 2020 - ieeexplore.ieee.org
Achieving fault tolerance is one of the significant challenges of exascale computing due to
projected increases in soft/transient failures. While past work on software-based resilience …

Enabling resilience in asynchronous many-task programming models

SR Paul, A Hayashi, N Slattengren, H Kolla… - Euro-Par 2019: Parallel …, 2019 - Springer
Resilience is an imminent issue for next-generation platforms due to projected increases in
soft/transient failures as part of the inherent trade-offs among performance, energy, and …

Towards distributed software resilience in asynchronous many-task programming models

N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Implementing software resiliency in hpx for extreme scale computing

N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Asymmetric resilience: Exploiting task-level idempotency for transient error recovery in accelerator-based systems

J Leng, A Buyuktosunoglu, R Bertran… - … Symposium on High …, 2020 - ieeexplore.ieee.org
Accelerators make the task of building systems that are re-silient against transient errors like
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …

Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems

J Posner - 2021 - kobra.uni-kassel.de
Abstract High-Performance Computing (HPC) enables solving complex problems from
various scientific fields including key societal problems such as COVID-19. Recently …

[PDF][PDF] Scalable and highly available fault resilient programming middleware for exascale computing

A Takefusa, T Ikegami, H Nakada… - … of IEEE/ACM …, 2014 - sc14.supercomputing.org
A hierarchical master-worker model is believed to be a promising programming paradigm
that can achieve weak scaling on exascale-level high performance computers [1] …

Declarative resilience: A holistic soft-error resilient multicore architecture that trades off program accuracy for efficiency

H Omar, Q Shi, M Ahmad, H Dogan… - ACM Transactions on …, 2018 - dl.acm.org
To protect multicores from soft-error perturbations, research has explored various resiliency
schemes that provide high soft-error coverage. However, these schemes incur high …

Chaser: An enhanced fault injection tool for tracing soft errors in mpi applications

Q Guan, X Hu, T Grove, B Fang, H Jiang… - 2020 50th Annual …, 2020 - ieeexplore.ieee.org
Resilient computation has been an emerging topic in the field of high-performance
computing (HPC). In particular, studies show that tolerating faults on leadership-class …

ABFR: convenient management of latent error resilience using application knowledge

A Fang, AA Chien - Proceedings of the 27th international symposium on …, 2018 - dl.acm.org
Exascale systems face high error-rates due to increasing scale (109 cores), software
complexity and rising memory error rates. Increasingly, errors escape immediate hardware …