Optimizing checkpoint intervals for reduced energy use in exascale systems

D Dauwe, R Jhaveri, S Pasricha… - 2017 Eighth …, 2017 - ieeexplore.ieee.org
In today's high performance computing (HPC) systems, the probability of applications
experiencing failures has increased significantly with the increase in the number of system …

Resilience-aware resource management for exascale computing systems

D Dauwe, S Pasricha, AA Maciejewski… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
With the increases in complexity and number of nodes in large-scale high performance
computing (HPC) systems over time, the probability of applications experiencing runtime …