Use cases of lossy compression for floating-point data in scientific data sets

F Cappello, S Di, S Li, X Liang… - … Journal of High …, 2019 - journals.sagepub.com
Architectural and technological trends of systems used for scientific computing call for a
significant reduction of scientific data sets that are composed mainly of floating-point data …

[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Fast error-bounded lossy HPC data compression with SZ

S Di, F Cappello - 2016 ieee international parallel and …, 2016 - ieeexplore.ieee.org
Today's HPC applications are producing extremely large amounts of data, thus it is
necessary to use an efficient compression before storing them to parallel file systems. In this …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

[图书][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

A 5D gyrokinetic full-f global semi-Lagrangian code for flux-driven ion turbulence simulations

V Grandgirard, J Abiteboul, J Bigot… - Computer physics …, 2016 - Elsevier
This paper addresses non-linear gyrokinetic simulations of ion temperature gradient (ITG)
turbulence in tokamak plasmas. The electrostatic GYSELA code is one of the few …

Toward a smart cloud: A review of fault-tolerance methods in cloud systems

MA Mukwevho, T Celik - IEEE Transactions on Services …, 2018 - ieeexplore.ieee.org
This paper presents a comprehensive survey of the state-of-the-art work on fault tolerance
methods proposed for cloud computing. The survey classifies fault-tolerance methods into …

Active Flash: Towards {Energy-Efficient},{In-Situ} Data Analytics on {Extreme-Scale} Machines

D Tiwari, S Boboila, S Vazhkudai, Y Kim, X Ma… - … USENIX Conference on …, 2013 - usenix.org
Modern scientific discovery is increasingly driven by large-scale supercomputing
simulations, followed by data analysis tasks. These data analyses are either performed …

Veloc: Towards high performance adaptive asynchronous checkpointing at large scale

B Nicolae, A Moody, E Gonsiorowski… - 2019 IEEE …, 2019 - ieeexplore.ieee.org
Global checkpointing to external storage (eg, a parallel file system) is a common I/O pattern
of many HPC applications. However, given the limited I/O throughput of external storage …

Fault prediction under the microscope: A closer look into HPC systems

A Gainaru, F Cappello, M Snir… - SC'12: Proceedings of …, 2012 - ieeexplore.ieee.org
A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …