Resilience is an imminent issue for next-generation platforms due to projected increases in soft/transient failures as part of the inherent trade-offs among performance, energy, and …
N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …
SR Paul, A Hayashi, M Whitlock, S Bak… - 2020 Workshop on …, 2020 - ieeexplore.ieee.org
Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience …
Abstract High-Performance Computing (HPC) enables solving complex problems from various scientific fields including key societal problems such as COVID-19. Recently …
Accelerators make the task of building systems that are re-silient against transient errors like voltage noise and soft errors hard. Architects integrate accelerators into the system as black …
N Morales, K Teranishi, B Nicolae, C Trott… - Euro-Par 2021: Parallel …, 2021 - Springer
In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable …
Leveraging a Task-based Asynchronous Dataflow Substrate for Efficient and Scalable Resiliency Page 1 Leveraging a Task-based Asynchronous Dataflow Substrate for Efficient and …
The manycore revolution is steadily increasing the performance and size of massively parallel systems, to the point where system reliability becomes a pressing concern …
Extreme-scale computing will bring significant changes to high performance computing system architectures. In particular, the increased number of system components is creating a …