Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop …
We study the usefulness of partial redundancy in HPC message passing systems where individual node failure distributions are not identical. Prior research works on fault tolerance …
Future high-performance computing (HPC) systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on …
Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high- performance computing (HPC) systems. In this study, we present an automatic, efficient and …
N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …
N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …
This work is based on the seminar titled" Resiliency in Numerical Algorithm Design for Extreme Scale Simulations" held March 1-6, 2020 at Schloss Dagstuhl, that was attended by …
Abstract Upcoming Extreme Scale, or Exascale, Computing Systems are expected to deliver a peak performance of at least 10 18 floating point operations per second (FLOPS), primarily …
O Abu-Sharkh, AH Tewfik - … and Computing (ITCC'05)-Volume II, 2005 - ieeexplore.ieee.org
This paper introduces a new scheduling scheme that provides fair access to all stations in 802.11 b WLANs. The scheme divides the transmission opportunities between wireless …