Improving the computing efficiency of HPC systems using a combination of proactive and preventive...

[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org

Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

被引用次数：424 相关文章所有 14 个版本

[PDF] academia.edu

Failure prediction for HPC systems and applications: Current situation and open issues

A Gainaru, F Cappello, M Snir… - … International journal of …, 2013 - journals.sagepub.com

As large-scale systems evolve towards post-petascale computing, it is crucial to focus on
providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far …

被引用次数：65 相关文章所有 9 个版本

[PDF] rutgers.edu

Exploring automatic, online failure recovery for scientific applications at extreme scales

M Gamell, DS Katz, H Kolla, J Chen… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org

Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …

被引用次数：127 相关文章所有 8 个版本

[PDF] psu.edu

Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems

D Tiwari, S Gupta, SS Vazhkudai - 2014 44th Annual IEEE/IFIP …, 2014 - ieeexplore.ieee.org

Continuing increase in the computational power of supercomputers has enabled large-scale
scientific applications in the areas of astrophysics, fusion, climate and combustion to run …

被引用次数：118 相关文章所有 9 个版本

[PDF] osti.gov

Machine learning models for GPU error prediction in a large scale HPC system

B Nie, J Xue, S Gupta, T Patel… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org

GPUs are widely deployed on large-scale HPC systems to provide powerful computational
capability for scientific applications from various domains. As those applications are …

被引用次数：85 相关文章所有 12 个版本

[PDF] academia.edu

Optimization of multi-level checkpoint model for large scale HPC applications

S Di, MS Bouguerra, L Bautista-Gomez… - 2014 IEEE 28th …, 2014 - ieeexplore.ieee.org

HPC community projects that future extreme scale systems will be much less stable than
current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the …

被引用次数：114 相关文章所有 10 个版本

[PDF] unibo.it

Anomaly detection and anticipation in high performance computing systems

A Borghesi, M Molan, M Milano… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly
becoming larger and more complex, together with the issues concerning their maintenance …

被引用次数：29 相关文章所有 6 个版本

[PDF] umn.edu

Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

被引用次数：52 相关文章所有 11 个版本

[PDF] academia.edu

Digging deeper into cluster system logs for failure prediction and root cause diagnosis

X Fu, R Ren, SA McKee, J Zhan… - 2014 IEEE International …, 2014 - ieeexplore.ieee.org

As the sizes of supercomputers and data centers grow towards exascale, failures become
normal. System logs play a critical role in the increasingly complex tasks of automatic failure …

被引用次数：56 相关文章所有 4 个版本

[PDF] arxiv.org

Toward reliable and rapid elasticity for streaming dataflows on clouds

A Shukla, Y Simmhan - 2018 IEEE 38th International …, 2018 - ieeexplore.ieee.org

The pervasive availability of streaming data is driving Fast Data platforms for low-latency
streaming applications. Such applications need to respond to dynamism in the input rates …

被引用次数：28 相关文章所有 6 个版本

高级搜索

QQ 群