A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations.- 学术资源搜索

[PDF][PDF] A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations.

A Duarte, D Rexachs, E Luque - PARCO, 2005 - juser.fz-juelich.de

PARCO, 2005•juser.fz-juelich.de

Fault tolerance is a critical point for long-running parallel-distributed applications executing
in Massive Cluster of Workstations (MCOW). From the user's point of view, a parallel
application should run and finish correctly, but users rarely want to worry about including
fault tolerance capabilities in their algorithms because of the software engineering costs.
The cluster's administrators, in turn, request that the fault tolerance scheme consume as few
resources as possible. Because increasing the number of nodes causes the MTBF to …

Fault tolerance is a critical point for long-running parallel-distributed applications executing in Massive Cluster of Workstations (MCOW). From the user’s point of view, a parallel application should run and finish correctly, but users rarely want to worry about including fault tolerance capabilities in their algorithms because of the software engineering costs. The cluster’s administrators, in turn, request that the fault tolerance scheme consume as few resources as possible. Because increasing the number of nodes causes the MTBF to reduce, long-running applications demands a fault tolerance scheme that be independent of the cluster scalability. The fault tolerance scheme could rely on a fault-tolerant hardware; however, such solution is expensive in practice. An alternative would be to develop fault-tolerant algorithms. However, such solution demands a big software engineering effort, and it cannot be applied to algorithms already coded. A third possibility is to build a software layer between the application and the system in order to isolate the faults from the application. This is an interesting alternative since it does not require any special hardware, and makes the fault tolerance scheme fully transparent to the user. Rollback-recovery is the classical method used when it is necessary to offer fault tolerance for a long-running parallel-distributed application based on message passing in a cluster [7]. Rollbackrecovery techniques can be implemented by a software layer, and are divided into two broad categories: the ones based only in checkpoints and the ones based in checkpoint and in message logs. The first ones make checkpoints of individual processes associated with a certain checkpoint synchronization scheme to assure that the system rolls back and recover from a consistent global state after a failure. The latter ones also aggregate message’s logs for each individual process in order to allow the system to recover from a point later than the latter consistent global state [5]. The main difference between these techniques is expressed by two parameters: a) the runtime overhead over fault-free execution and b) the fault penalty. The overhead directly relates to the efficiency of the rollback-recovery scheme without faults, and it is strongly related to the cluster resources consumed by the rollback-recovery protocol [4]. For example, from the user’s point of view, a fault tolerance scheme using message log interferes on the latency of message’s transmission. Furthermore, both checkpoints and logs require a additional storage resources and such resources impact over the cluster architecture.

juser.fz-juelich.de

展开收起

被引用次数：15 相关文章所有 6 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果

高级搜索

QQ 群

[PDF][PDF] A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations.

引用