Designing and modelling selective replication for fault-tolerant hpc applications

O Subasi, G Yalcin, F Zyulkyarov… - 2017 17th IEEE/ACM …, 2017 - ieeexplore.ieee.org
2017 17th IEEE/ACM International Symposium on Cluster, Cloud and …, 2017ieeexplore.ieee.org
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for
High Performance Computing (HPC) applications. There are studies that address fail-stop
errors and studies that address SDCs. However few studies address both types of errors
together. In this paper we propose a software-based selective replication technique for HPC
applications for both fail-stop errors and SDCs. Since complete replication of applications
can be costly in terms of resources, we develop a runtime-based technique for selective …
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果