相关文章- 学术资源搜索

Partial redundancy in hpc systems with non-uniform node reliabilities

Z Hussain, T Znati, R Melhem - SC18: International Conference …, 2018 - ieeexplore.ieee.org

We study the usefulness of partial redundancy in HPC message passing systems where
individual node failure distributions are not identical. Prior research works on fault tolerance …

被引用次数：23 相关文章所有 7 个版本

[PDF] pitt.edu

Color: Co-located rescuers for fault tolerance in hpc systems

Z Hussain, X Cui, T Znati… - 2018 IEEE 24th …, 2018 - ieeexplore.ieee.org

With the increase in scale of HPC systems, the frequency of system wide failures is expected
to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault …

被引用次数：6 相关文章所有 7 个版本

[PDF] ncsu.edu

Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org

Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

被引用次数：205 相关文章所有 20 个版本

[PDF] hal.science

Assuming failure independence: are we right to be wrong?

G Aupy, Y Robert, F Vivien - 2017 IEEE International …, 2017 - ieeexplore.ieee.org

This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in
the analysis of resilience methods for HPC. We explain why a previous approach is …

被引用次数：23 相关文章所有 11 个版本

[PDF] researchgate.net

Optimizing HPC fault-tolerant environment: An analytical approach

H Jin, Y Chen, H Zhu, XH Sun - 2010 39th International …, 2010 - ieeexplore.ieee.org

The increasingly large ensemble size of modern High-Performance Computing (HPC)
systems has drastically increased the possibility of failures. Performance under failures and …

被引用次数：63 相关文章所有 13 个版本

[PDF] osti.gov

See applications run and throughput jump: The case for redundant computing in HPC

R Riesen, K Ferreira, J Stearley - … International Conference on …, 2010 - ieeexplore.ieee.org

For future parallel-computing systems with as few as twenty-thousand nodes we propose
redundant computing to reduce the number of application interrupts. The frequency of faults …

被引用次数：29 相关文章所有 8 个版本

[PDF] psu.edu

[PDF][PDF] Toward efficient failre detection and recovery in HPC

S Rani, C Leangsuksun, A Tikotekar… - High Availability and …, 2006 - Citeseer

Application outages due to node failures are common problems in high performance
computing. Reliability becomes a major issue, especially for long running jobs, as the …

被引用次数：12 相关文章所有 3 个版本

[PDF] illinois.edu

Assessing energy efficiency of fault tolerance protocols for HPC systems

E Meneses, O Sarood, LV Kalé - 2012 IEEE 24th International …, 2012 - ieeexplore.ieee.org

An exascale machine is expected to be delivered in the time frame 2018-2020. Such a
machine will be able to tackle some of the hardest computational problems and to extend …

被引用次数：62 相关文章所有 12 个版本

[PDF] osti.gov

Dino: Divergent node cloning for sustained redundancy in hpc

A Rezaei, F Mueller, P Hargrove, E Roman - Journal of Parallel and …, 2017 - Elsevier

Complexity and scale of next generation HPC systems pose significant challenges in fault
resilience methods such that contemporary checkpoint/restart (C/R) methods that address …

被引用次数：12 相关文章所有 17 个版本

[PDF] psu.edu

[PDF][PDF] Reliability Analysis in HPC clusters

N Raju, YL Gottumukkala, CB Leangsuksun… - Proceedings of the High …, 2006 - Citeseer

Resource failures and down times have become a growing concern for large-scale
computational platforms, as they tend to have an adverse affect on the performance of the …

被引用次数：30 相关文章所有 2 个版本

高级搜索

QQ 群