J Zhao, Y Xiang, T Lan, HH Huang… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
Modern day data centers coordinate hundreds of thousands of heterogeneous tasks and aim at delivering highly reliable cloud computing services. Although offering equal reliability …
N El-Sayed, B Schroeder - 2014 IEEE International Conference …, 2014 - ieeexplore.ieee.org
As the scale of high-performance computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design …
Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that …
N El-Sayed, B Schroeder - IEEE Transactions on Dependable …, 2016 - ieeexplore.ieee.org
As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious design …
Z Miao, JC Calhoun, R Ge - Parallel Computing, 2025 - Elsevier
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency …
The fault tolerance method most used today in high-performance computing (HPC) is coordinated checkpointing. This, like any other fault tolerance method, adds additional …
Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area …
Checkpointing is a fault-tolerance mechanism commonly used in High Throughput Computing (HTC) environments to allow the execution of long-running computational tasks …
Z Miao, J Calhoun, R Ge - 2018 IEEE International Conference …, 2018 - ieeexplore.ieee.org
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency …