High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …
Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even …
Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message …
This paper introduces an encapsulated sensor node that is devised to monitor and record motion patterns over long, quotidian periods of time with potential application in …
M Morán, J Balladini, D Rexachs… - 2020 IEEE Congreso …, 2020 - ieeexplore.ieee.org
High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or …
The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node …
L Zhang, Z Wang, F Kong - 2022 IEEE Real-Time Systems …, 2022 - ieeexplore.ieee.org
This paper proposes an optimal checkpoint scheme for fault resilience in real-time systems, in which we consider both logical consistency and timing correctness. First, we partition …
With the growing scale of HPC applications, there has been an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean …
With our current expectation for the exascale systems, composed of hundred of thousands of many‐core nodes, the mean time between failures will become small, even under the most …