相关文章- 学术资源搜索

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org

Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

被引用次数：33 相关文章所有 12 个版本

[PDF] colostate.edu

An analysis of resilience techniques for exascale computing platforms

D Dauwe, S Pasricha, AA Maciejewski… - 2017 IEEE …, 2017 - ieeexplore.ieee.org

With the increase in the complexity and number of nodes in large-scale high performance
computing (HPC) systems, the probability of applications experiencing failures has …

被引用次数：19 相关文章所有 7 个版本

Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice

D Jauk, D Yang, M Schulz - … of the International Conference for High …, 2019 - dl.acm.org

As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …

被引用次数：39 相关文章

[PDF] acm.org

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

被引用次数：174 相关文章所有 12 个版本

[PDF] unl.edu

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com

We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

被引用次数：526 相关文章所有 20 个版本

[PDF] osti.gov

Lessons learned from memory errors observed over the lifetime of Cielo

S Levy, KB Ferreira, N DeBardeleben… - … Conference for High …, 2018 - ieeexplore.ieee.org

Maintaining the performance of high-performance computing (HPC) applications as failures
increase is a major challenge for next-generation extreme-scale systems. Recent work …

被引用次数：43 相关文章所有 6 个版本

[PDF] christian-engelmann.info

Reducing waste in extreme scale systems through introspective analysis

L Bautista-Gomez, A Gainaru… - 2016 IEEE …, 2016 - ieeexplore.ieee.org

Resilience is an important challenge for extreme-scale supercomputers. Today, failures in
supercomputers are assumed to be uniformly distributed in time. However, recent studies …

被引用次数：48 相关文章所有 14 个版本

Hpc hardware design reliability benchmarking with hdfit

P Omland, A Netti, Y Peng, A Baldovin… - … on Parallel and …, 2023 - ieeexplore.ieee.org

Chips pack ever more, ever smaller transistors. Fault rates increase in turn and become
more concerning, particularly at the scale of High-Performance Computing (HPC) systems …

被引用次数：5 相关文章所有 2 个版本

Codesign challenges for exascale systems: Performance, power, and reliability

D Kerbyson, A Vishnu, K Barker, A Hoisie - Computer, 2011 - ieeexplore.ieee.org

The complexity of large-scale parallel systems necessitates the simultaneous optimization of
multiple hardware and software components to meet performance, efficiency, and fault …

被引用次数：25 相关文章所有 6 个版本

[PDF] toronto.edu

Reading between the lines of failure logs: Understanding how HPC systems fail

N El-Sayed, B Schroeder - 2013 43rd annual IEEE/IFIP …, 2013 - ieeexplore.ieee.org

As the component count in supercomputing installations continues to increase, system
reliability is becoming one of the major issues in designing HPC systems. These issues will …

被引用次数：133 相关文章所有 8 个版本

高级搜索

QQ 群