This paper presents Phoenix, a communication and synchronization substrate that implements a novel protocol for recovering from fail-stop faults when executing graph …
X Tang, J Zhai, B Yu, W Chen… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless …
JD Zhai, WG Chen - Frontiers of Information Technology & Electronic …, 2018 - Springer
Exascale systems have been under development for quite some time and will be available for use in a few years. It is time to think about future post-exascale systems. There are many …
S Yang, K Wu, Y Qiao, D Li… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main …
HPC systems continue to scale by including more hardware components for supporting larger application deployments. Critically, this scaling tends to decrease the mean time …
The first Petascale supercomputer, the IBM Roadrunner, went online in 2008. Ten years later, the community is now looking ahead to a new generation of Exascale machines …
Y Huang, K Wu, D Li - arXiv preprint arXiv:1705.00264, 2017 - arxiv.org
Resilience is a major design goal for HPC. Checkpoint is the most common method to enable resilient HPC. Checkpoint periodically saves critical data objects to non-volatile …
À mesure que la puissance de calcul des nouveaux supercalculateurs augmente, leur fiabilitédécroît inexorablement. En effet les limites sont repoussées en augmentant le …
N Wang, Q Sun, Y Liu, D Qian - 2018 IEEE 20th International …, 2018 - ieeexplore.ieee.org
Checkpointing is the most widely used technique in high performance computing systems to tolerate fail-stop errors and ensure reliable execution of parallel applications. However, with …