Letgo: A lightweight continuous framework for hpc applications under failures

B Fang, Q Guan, N Debardeleben… - Proceedings of the 26th …, 2017 - dl.acm.org
Proceedings of the 26th International Symposium on High-Performance Parallel …, 2017dl.acm.org
Requirements for reliability, low power consumption, and performance place complex and
conflicting demands on the design of high-performance computing (HPC) systems. Fault-
tolerance techniques such as checkpoint/restart (C/R) protect HPC applications against
hardware faults. These techniques, however, have non negligible overheads particularly
when the fault rate exposed by the hardware is high: it is estimated that in future HPC
systems, up to 60% of the computational cycles/power will be used for fault tolerance. To …
Requirements for reliability, low power consumption, and performance place complex and conflicting demands on the design of high-performance computing (HPC) systems. Fault-tolerance techniques such as checkpoint/restart (C/R) protect HPC applications against hardware faults. These techniques, however, have non negligible overheads particularly when the fault rate exposed by the hardware is high: it is estimated that in future HPC systems, up to 60% of the computational cycles/power will be used for fault tolerance.
To mitigate the overall overhead of fault-tolerance techniques, we propose LetGo, an approach that attempts to continue the execution of a HPC application when crashes would otherwise occur. Our hypothesis is that a class of HPC applications have good enough intrinsic fault tolerance so that its possible to re-purpose the default mechanism that terminates an application once a crash-causing error is signalled, and instead attempt to repair the corrupted application state, and continue the application execution. This paper explores this hypothesis, and quantifies the impact of using this observation in the context of checkpoint/restart (C/R) mechanisms.
Our fault-injection experiments using a suite of five HPC applications show that, on average, LetGo is able to elide 62% of the crashes encountered by applications, of which 80% result in correct output, while incurring a negligible performance overhead. As a result, when LetGo is used in conjunction with a C/R scheme, it enables significantly higher efficiency thereby leading to faster time to solution.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果