Fault-tolerant protocol for hybrid task-parallel message-passing applications

Z Wang, L Gao, Y Gu, Y Bao, G Yu - … of the Seventh ACM Symposium on …, 2016 - dl.acm.org

Many graph algorithms are iterative in nature and can be supported by distributed memory-
based systems in a synchronous manner. However, an asynchronous model has been …

被引用次数：29 相关文章所有 5 个版本

[PDF] upc.edu

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org

As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

被引用次数：31 相关文章所有 7 个版本

[PDF] thecvf.com

Vits: video tagging system from massive web multimedia collections

D Fernández, D Varas, J Espadaler… - Proceedings of the …, 2017 - openaccess.thecvf.com

The popularization of multimedia content on the Web has arised the need to automatically
understand, index and retrieve it. In this paper we present ViTS, an automatic Video Tagging …

被引用次数：17 相关文章所有 9 个版本

[PDF] sciencedirect.com

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier

As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

被引用次数：18 相关文章所有 7 个版本

[PDF] unlp.edu.ar

A methodology for soft errors detection and automatic recovery

D Montezanti, A De Giusti, M Naiouf… - … Conference on High …, 2017 - ieeexplore.ieee.org

Handling faults is a growing concern in HPC; higher error rates, larger detection intervals
and silent faults are expected in the future. It is projected that, in exascale systems, errors …

被引用次数：13 相关文章所有 6 个版本

[PDF] arxiv.org

Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

D Montezanti, E Rucci, A De Giusti, M Naiouf… - Future Generation …, 2020 - Elsevier

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that
silent undetected errors will occur several times a day, increasing the occurrence of …

被引用次数：6 相关文章所有 10 个版本

[PDF] sciencedirect.com

Portable application-level checkpointing for hybrid MPI-OpenMP applications

N Losada, MJ Martín, G Rodríguez… - Procedia Computer …, 2016 - Elsevier

As parallel machines increase their number of processors, so does the failure rate of the
global system, thus, long-running applications will need to make use of fault tolerance …

被引用次数：14 相关文章所有 6 个版本

[PDF] osti.gov

Dino: Divergent node cloning for sustained redundancy in hpc

A Rezaei, F Mueller, P Hargrove, E Roman - Journal of Parallel and …, 2017 - Elsevier

Complexity and scale of next generation HPC systems pose significant challenges in fault
resilience methods such that contemporary checkpoint/restart (C/R) methods that address …

被引用次数：12 相关文章所有 17 个版本

[PDF] agu.edu.tr

A runtime heuristic to selectively replicate tasks for application-specific reliability targets

O Subasi, G Yalcin, F Zyulkyarov… - 2016 IEEE …, 2016 - ieeexplore.ieee.org

In this paper we propose a runtime-based selective task replication technique for task-
parallel high performance computing applications. Our selective task replication technique is …

被引用次数：12 相关文章所有 9 个版本

[PDF] uci.edu

FPGA architecture characterization for system level performance analysis

D Densmore, A Donlin… - Proceedings of the …, 2006 - ieeexplore.ieee.org

We present a modular and scalable approach for automatically extracting actual
performance information from a set of FPGA-based architecture topologies. This information …

被引用次数：28 相关文章所有 14 个版本

高级搜索

QQ 群