相关文章- 学术资源搜索

Letgo: A lightweight continuous framework for hpc applications under failures

B Fang, Q Guan, N Debardeleben… - Proceedings of the 26th …, 2017 - dl.acm.org

Requirements for reliability, low power consumption, and performance place complex and
conflicting demands on the design of high-performance computing (HPC) systems. Fault …

被引用次数：28 相关文章所有 4 个版本

[PDF] illinois.edu

Assessing energy efficiency of fault tolerance protocols for HPC systems

E Meneses, O Sarood, LV Kalé - 2012 IEEE 24th International …, 2012 - ieeexplore.ieee.org

An exascale machine is expected to be delivered in the time frame 2018-2020. Such a
machine will be able to tackle some of the hardest computational problems and to extend …

被引用次数：62 相关文章所有 12 个版本

[PDF] pitt.edu

Color: Co-located rescuers for fault tolerance in hpc systems

Z Hussain, X Cui, T Znati… - 2018 IEEE 24th …, 2018 - ieeexplore.ieee.org

With the increase in scale of HPC systems, the frequency of system wide failures is expected
to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault …

被引用次数：6 相关文章所有 7 个版本

[PDF] epfl.ch

SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing

T Ropars, TV Martsinkevich, A Guermouche… - Proceedings of the …, 2013 - dl.acm.org

The high failure rate expected for future supercomputers requires the design of new fault
tolerant solutions. Most checkpointing protocols are designed to work with any message …

被引用次数：52 相关文章所有 14 个版本

Asynchronous checkpointing by dedicated checkpoint threads

F Shahzad, M Wittmann, T Zeiser, G Wellein - Recent Advances in the …, 2012 - Springer

Abstract Checkpoint/restart (C/R) is a classical approach to introduce fault tolerance in large
HPC applications. Although it is relatively easy as compared to other fault tolerance …

被引用次数：14 相关文章所有 5 个版本

[PDF] hal.science

Multi-criteria checkpointing strategies: Response-time versus resource utilization

A Bouteiller, F Cappello, J Dongarra… - Euro-Par 2013 Parallel …, 2013 - Springer

Failures are increasingly threatening the efficiency of HPC systems, and current projections
of Exascale platforms indicate that rollback recovery, the most convenient method for …

被引用次数：19 相关文章所有 15 个版本

[PDF] toronto.edu

Checkpoint/restart in practice: When 'simple is better'

N El-Sayed, B Schroeder - 2014 IEEE International Conference …, 2014 - ieeexplore.ieee.org

Efficient use of high-performance computing (HPC) installations critically relies on effective
methods for fault tolerance. The most commonly used method is checkpoint/restart, where …

被引用次数：27 相关文章所有 5 个版本

[PDF] arxiv.org

CRAFT: A library for easier application-level checkpoint/restart and automatic fault tolerance

F Shahzad, J Thies, M Kreutzer, T Zeiser… - … on Parallel and …, 2018 - ieeexplore.ieee.org

In order to efficiently use the future generations of supercomputers, fault tolerance and
power consumption are two of the prime challenges anticipated by the High Performance …

被引用次数：70 相关文章所有 10 个版本

[HTML] springer.com

[HTML][HTML] A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer

Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

被引用次数：343 相关文章所有 12 个版本

[PDF] hal.science

Assuming failure independence: are we right to be wrong?

G Aupy, Y Robert, F Vivien - 2017 IEEE International …, 2017 - ieeexplore.ieee.org

This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in
the analysis of resilience methods for HPC. We explain why a previous approach is …

被引用次数：23 相关文章所有 11 个版本

高级搜索

QQ 群