Legio: fault resiliency for embarrassingly parallel MPI applications

R Rocco, D Gadioli, G Palermo - The Journal of Supercomputing, 2022 - Springer
The Journal of Supercomputing, 2022Springer
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due
to their high frequency. Natively, MPI cannot handle faults and it stops the execution
prematurely when it finds one. With the introduction of ULFM, it is possible to continue the
execution, but it requires complex integration with the application. In this paper we propose
Legio, a framework that introduces fault resiliency in embarrassingly parallel MPI
applications. Legio exposes its features to the application transparently, removing any …
Abstract
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their high frequency. Natively, MPI cannot handle faults and it stops the execution prematurely when it finds one. With the introduction of ULFM, it is possible to continue the execution, but it requires complex integration with the application. In this paper we propose Legio, a framework that introduces fault resiliency in embarrassingly parallel MPI applications. Legio exposes its features to the application transparently, removing any integration difficulty. After a fault, the execution continues only with the non-failed processes. We also propose a hierarchical alternative, which features lower repair costs on large communicators. We evaluated our solutions on the Marconi100 cluster at CINECA with benchmarks and real-world applications, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果