Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

Job migration in hpc clusters by means of checkpoint/restart

M Rodríguez-Pascual, J Cao, JA Moríñigo… - The Journal of …, 2019 - Springer
Until now, jobs running on HPC clusters were tied to the node where their execution started.
We have removed that limitation by integrating a user-level checkpoint/restart library into a …

Fault-tolerant in embedded systems (MPSoC): Performance estimation and dynamic migration tasks

K Smiri, S Bekri, H Smei - 2016 11th International Design & Test …, 2016 - ieeexplore.ieee.org
Multiprocessor Systems-on-Chip (MPSoC) allow the implementation of heterogeneous
architectures with a high integration capacity. In recent years, computational requirements …

ExaMig matrix: Process migration based on matrix definition of selecting destination in distributed exascale environments

EM Khaneghah, AR ShowkatAbad… - Azerbaijan Journal of …, 2018 - 82.194.3.83
In traditional computing system, load balancer, interim selecting the process, determine the
destination computing node based on describing Indicators process status. In distributed …

A Comparative Assessment of Machine Learning Models For Predicting Wind Speed

N Atashfaraz, F Gholamrezaie, A Hosseini… - Azerbaijan Journal of …, 2022 - 82.194.3.83
Renewable energy is one of the most critical issues of continuously increasing electricity
consumption which is becoming a desirable alternative to traditional methods of electricity …

Details Hits: 14783

EM Khaneghah, AR ShowkatAbad, N Shadnoush… - azjhpc.org
In traditional computing system, load balancer, interim selecting the process, determine the
destination computing node based on describing Indicators process status. In distributed …

[图书][B] Fault tolerance configuration and management for HPC applications using RADIC architecture

JL Villamayor Leguizamón - 2019 - ddd.uab.cat
Los sistemas de computación de alto rendimiento (HPC) continúan creciendo
exponencialmente en términos de cantidad y densidad de componentes para lograr mayor …

A Formal Approach to implement java exceptions in cooperative systems

S Hanazumi, ACV de Melo - Journal of Systems and Software, 2017 - Elsevier
The increasing number of systems that work on the top of cooperating elements have
required new techniques to control cooperation on both normal and abnormal behaviors of …

Enabling and exploiting process-level task migration in Open MPI with BarbequeRTRM

F REGHENZANI - 2015 - politesi.polimi.it
Abstract The High Performance Computing (HPC) systems typically include a large number
of computing resources-CPUs, GPUs, etc. As a consequence, we must face with the problem …