An ABFT scheme based on communication characteristics

U Kabir, D Goswami - 2016 IEEE International Conference on …, 2016 - ieeexplore.ieee.org
U Kabir, D Goswami
2016 IEEE International Conference on Cluster Computing (CLUSTER), 2016ieeexplore.ieee.org
The paper presents an Algorithm Based Fault Tolerance (ABFT) scheme that can be applied
to a class of parallel applications with similar algorithmic and communication characteristics.
Unlike an ABFT scheme that is tied to specific application, our approach considers a class of
parallel applications with similar characteristics where the scheme can be applied. The
ideas are elaborated in the context of parallel dynamic programming class of applications,
however are not limited only to dynamic programming. The communication characteristics of …
The paper presents an Algorithm Based Fault Tolerance (ABFT) scheme that can be applied to a class of parallel applications with similar algorithmic and communication characteristics. Unlike an ABFT scheme that is tied to specific application, our approach considers a class of parallel applications with similar characteristics where the scheme can be applied. The ideas are elaborated in the context of parallel dynamic programming class of applications, however are not limited only to dynamic programming. The communication characteristics of an application determine how to distributively save the fault recovery data (we call it the critical data) of a process so as to minimize any extra message overhead, and the algorithmic characteristics of an application determine what data is to be saved in order to minimize fault tolerance and recovery cost. A fault tolerance protocol is presented for the class of applications with similar algorithmic and communication characteristics. As a case study, a specific approach to fault tolerance of parallel dynamic programming class of applications is investigated. Experimental results demonstrate low fault tolerance overhead over a non fault tolerant application in a failure free execution, and low recovery overhead in the case of single and multiple process failures. Moreover, comparison with diskless checkpointing shows promising results.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果