System resilience at extreme scale

[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org

Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

被引用次数：434 相关文章所有 14 个版本

[PDF] unl.edu

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com

We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

被引用次数：539 相关文章所有 20 个版本

[PDF] hal.science

/spl times/pipes Lite: a synthesis oriented design library for networks on chips

S Stergiou, F Angiolini, S Carta, L Raffo… - … Automation and Test …, 2005 - ieeexplore.ieee.org

The limited scalability of current bus topologies for systems on chips (SoCs) dictates the
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …

被引用次数：192 相关文章所有 15 个版本

[PDF] psu.edu

The reliability wall for exascale supercomputing

X Yang, Z Wang, J Xue, Y Zhou - IEEE Transactions on …, 2011 - ieeexplore.ieee.org

Reliability is a key challenge to be understood to turn the vision of exascale supercomputing
into reality. Inevitably, large-scale supercomputing systems, especially those at the …

被引用次数：111 相关文章所有 9 个版本

[PDF] umn.edu

Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

被引用次数：55 相关文章所有 11 个版本

Making a case for distributed file systems at exascale

I Raicu, IT Foster, P Beckman - … of the third international workshop on …, 2011 - dl.acm.org

Exascale computers will enable the unraveling of significant scientific mysteries. Predictions
are that 2019 will be the year of exascale, with millions of compute nodes and billions of …

被引用次数：108 相关文章

[PDF] arxiv.org

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org

Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

被引用次数：53 相关文章所有 21 个版本

[PDF] christian-engelmann.info

[PDF][PDF] Redundant execution of HPC applications with MR-MPI

C Engelmann, S Böhm - Proceedings of the 10th IASTED …, 2011 - christian-engelmann.info

This paper presents a modular-redundant Message Passing Interface (MPI) solution, MR-
MPI, for transparently executing high-performance computing (HPC) applications in a …

被引用次数：82 相关文章所有 16 个版本

[PDF] academia.edu

Efficient synchronization under global EDF scheduling on multiprocessors

UMC Devi, H Leontyev… - … Euromicro Conference on …, 2006 - ieeexplore.ieee.org

We consider coordinating accesses to shared data structures in multiprocessor real-time
systems scheduled under preemptive global EDF. To our knowledge, prior work on global …

被引用次数：99 相关文章所有 13 个版本

[PDF] hal.science

Hydee: Failure containment without event logging for large scale send-deterministic mpi applications

A Guermouche, T Ropars, M Snir… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org

High performance computing will probably reach exascale in this decade. At this scale,
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …

被引用次数：75 相关文章所有 17 个版本

高级搜索

QQ 群