[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

/spl times/pipes Lite: a synthesis oriented design library for networks on chips

S Stergiou, F Angiolini, S Carta, L Raffo… - … Automation and Test …, 2005 - ieeexplore.ieee.org
The limited scalability of current bus topologies for systems on chips (SoCs) dictates the
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …

The reliability wall for exascale supercomputing

X Yang, Z Wang, J Xue, Y Zhou - IEEE Transactions on …, 2011 - ieeexplore.ieee.org
Reliability is a key challenge to be understood to turn the vision of exascale supercomputing
into reality. Inevitably, large-scale supercomputing systems, especially those at the …

Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org
Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

Making a case for distributed file systems at exascale

I Raicu, IT Foster, P Beckman - … of the third international workshop on …, 2011 - dl.acm.org
Exascale computers will enable the unraveling of significant scientific mysteries. Predictions
are that 2019 will be the year of exascale, with millions of compute nodes and billions of …

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

[PDF][PDF] Redundant execution of HPC applications with MR-MPI

C Engelmann, S Böhm - Proceedings of the 10th IASTED …, 2011 - christian-engelmann.info
This paper presents a modular-redundant Message Passing Interface (MPI) solution, MR-
MPI, for transparently executing high-performance computing (HPC) applications in a …

Efficient synchronization under global EDF scheduling on multiprocessors

UMC Devi, H Leontyev… - … Euromicro Conference on …, 2006 - ieeexplore.ieee.org
We consider coordinating accesses to shared data structures in multiprocessor real-time
systems scheduled under preemptive global EDF. To our knowledge, prior work on global …

Hydee: Failure containment without event logging for large scale send-deterministic mpi applications

A Guermouche, T Ropars, M Snir… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org
High performance computing will probably reach exascale in this decade. At this scale,
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …