Enabling resilience in asynchronous many-task programming models

SR Paul, A Hayashi, N Slattengren, H Kolla… - Euro-Par 2019: Parallel …, 2019 - Springer
Resilience is an imminent issue for next-generation platforms due to projected increases in
soft/transient failures as part of the inherent trade-offs among performance, energy, and …

Towards distributed software resilience in asynchronous many-task programming models

N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Implementing software resiliency in hpx for extreme scale computing

N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

[PDF][PDF] Leveraging a task-based asynchronous dataflow substrate for efficient and scalable resiliency

O Subasi, J Arias, J Labarta, O Unsal… - Workshop on …, 2014 - dpss.inesc-id.pt
Leveraging a Task-based Asynchronous Dataflow Substrate for Efficient and Scalable
Resiliency Page 1 Leveraging a Task-based Asynchronous Dataflow Substrate for Efficient and …

A cross-layer multicore architecture to tradeoff program accuracy and resilience overheads

Q Shi, H Hoffmann, O Khan - IEEE Computer Architecture …, 2014 - ieeexplore.ieee.org
To protect multicores from soft-error perturbations, resiliency schemes have been developed
with high coverage but high power/performance overheads (~ 2x). We observe that not all …

Declarative resilience: A holistic soft-error resilient multicore architecture that trades off program accuracy for efficiency

H Omar, Q Shi, M Ahmad, H Dogan… - ACM Transactions on …, 2018 - dl.acm.org
To protect multicores from soft-error perturbations, research has explored various resiliency
schemes that provide high soft-error coverage. However, these schemes incur high …

Evaluating the resilience of parallel applications

M Wilkening, F Previlon, DR Kaeli… - … on Defect and Fault …, 2018 - ieeexplore.ieee.org
Reliability is a significant design constraint for supercomputers and large-scale data centers.
Modeling the effects of faults on applications targeted to such systems allows system …

Asymmetric resilience: Exploiting task-level idempotency for transient error recovery in accelerator-based systems

J Leng, A Buyuktosunoglu, R Bertran… - … Symposium on High …, 2020 - ieeexplore.ieee.org
Accelerators make the task of building systems that are re-silient against transient errors like
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …

ABFR: convenient management of latent error resilience using application knowledge

A Fang, AA Chien - Proceedings of the 27th international symposium on …, 2018 - dl.acm.org
Exascale systems face high error-rates due to increasing scale (109 cores), software
complexity and rising memory error rates. Increasingly, errors escape immediate hardware …

Programmer-directed partial redundancy for resilient HPC

O Subasi, J Arias, O Unsal, J Labarta… - Proceedings of the 12th …, 2015 - dl.acm.org
In this work we propose partial task replication and checkpointing for task-parallel HPC
applications to mitigate silent data corruption (SDC) errors. As the complete replication of all …