A fault-tolerant framework for asynchronous iterative computations in cloud environments

Z Wang, L Gao, Y Gu, Y Bao, G Yu - … of the Seventh ACM Symposium on …, 2016 - dl.acm.org
Many graph algorithms are iterative in nature and can be supported by distributed memory-
based systems in a synchronous manner. However, an asynchronous model has been …

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Vits: video tagging system from massive web multimedia collections

D Fernández, D Varas, J Espadaler… - Proceedings of the …, 2017 - openaccess.thecvf.com
The popularization of multimedia content on the Web has arised the need to automatically
understand, index and retrieve it. In this paper we present ViTS, an automatic Video Tagging …

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

A methodology for soft errors detection and automatic recovery

D Montezanti, A De Giusti, M Naiouf… - … Conference on High …, 2017 - ieeexplore.ieee.org
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals
and silent faults are expected in the future. It is projected that, in exascale systems, errors …

Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

D Montezanti, E Rucci, A De Giusti, M Naiouf… - Future Generation …, 2020 - Elsevier
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that
silent undetected errors will occur several times a day, increasing the occurrence of …

Portable application-level checkpointing for hybrid MPI-OpenMP applications

N Losada, MJ Martín, G Rodríguez… - Procedia Computer …, 2016 - Elsevier
As parallel machines increase their number of processors, so does the failure rate of the
global system, thus, long-running applications will need to make use of fault tolerance …

Dino: Divergent node cloning for sustained redundancy in hpc

A Rezaei, F Mueller, P Hargrove, E Roman - Journal of Parallel and …, 2017 - Elsevier
Complexity and scale of next generation HPC systems pose significant challenges in fault
resilience methods such that contemporary checkpoint/restart (C/R) methods that address …

A runtime heuristic to selectively replicate tasks for application-specific reliability targets

O Subasi, G Yalcin, F Zyulkyarov… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
In this paper we propose a runtime-based selective task replication technique for task-
parallel high performance computing applications. Our selective task replication technique is …

FPGA architecture characterization for system level performance analysis

D Densmore, A Donlin… - Proceedings of the …, 2006 - ieeexplore.ieee.org
We present a modular and scalable approach for automatically extracting actual
performance information from a set of FPGA-based architecture topologies. This information …