作者
Tatiana Martsinkevich, Omer Subasi, Osman Unsal, Franck Cappello, Jesus Labarta
发表日期
2015/9/8
研讨会论文
2015 IEEE International Conference on Cluster Computing
页码范围
563-570
出版商
IEEE
简介
We present a fault-tolerant protocol for task-parallel message-passing applications to mitigate transient errors. The protocol requires the restart only of the task that experienced the error and transparently handles any MPI calls inside the task. The protocol is implemented in Nanos -- a dataflow runtime for task-based OmpSs programming model -- and the PMPI profiling layer to fully support hybrid OmpSs+MPI applications. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks.
引用总数
201420152016201720182019202020211833232
学术搜索中的文章
T Martsinkevich, O Subasi, O Unsal, F Cappello… - 2015 IEEE International Conference on Cluster …, 2015