Abstract Applications structured as Directed Acyclic Graphs (DAGs) of tasks occur in many domains, including popular scientific workflows. DAG scheduling has thus received an …
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience techniques must accommodate both error sources. To cope with the double challenge, a two …
In the past few decades, data volume increases exponentially. Smart devices, social media, and e-business generate an extremely amount of data everyday. While big data is promising …
ME Ozturk, M Renardy, Y Li, G Agrawal… - 2018 IEEE 25th …, 2018 - ieeexplore.ieee.org
Soft errors or bit flips have recently become an important challenge in high performance computing. In this paper, we focus on soft errors in a particular algorithm: conjugate …
Fail-stop and silent errors are unavoidable on large-scale platforms. Efficient resilience techniques must accommodate both error sources. A traditional checkpointing and rollback …
N Karystinos, O Chatzopoulos, GM Fragkoulis… - users.uoa.gr
Several hyperscalers have recently disclosed the occurrence of Silent Data Corruptions (SDCs) in their systems fleets, sparking concerns about the severity of known and the …
AP Hinojosa, HJ Bungartz, D Pflüger - Sparse Grids and Applications …, 2018 - Springer
In this paper we show how to benefit from the numerical properties of a well-established extrapolation method—the combination technique—to make it tolerant to silent data …
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days …
EA Krluku, M Gusev, V Zdraveski - … of the 9th Balkan Conference on …, 2019 - dl.acm.org
This paper proposes a continuous health-check approach for detecting Silent Data Corruption (SCD) in High Performance Computing (HPC) environments. The goal is to …