Dynamic fault tolerance in fat trees

FO Sem-Jacobsen, T Skeie, O Lysne… - IEEE Transactions on …, 2010 - ieeexplore.ieee.org
IEEE Transactions on Computers, 2010ieeexplore.ieee.org
Fat trees are a very common communication architecture in current large-scale parallel
computers. The probability of failure in these systems increases with the number of
components. We present a routing method for deterministically and adaptively routed fat
trees, applicable to both distributed and source routing, that is able to handle several
concurrent faults and that transparently returns to the original routing strategy once the faulty
components have recovered. The method is local and dynamic, completely masking the fault …
Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k -1 benign simultaneous switch and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to handle significantly more faults beyond the k -1 limit with high probability.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果

Google学术搜索按钮

example.edu/paper.pdf
搜索
获取 PDF 文件
引用
References