相关文章- 学术资源搜索

A methodology for soft errors detection and automatic recovery

D Montezanti, A De Giusti, M Naiouf… - … Conference on High …, 2017 - ieeexplore.ieee.org

Handling faults is a growing concern in HPC; higher error rates, larger detection intervals
and silent faults are expected in the future. It is projected that, in exascale systems, errors …

被引用次数：13 相关文章所有 6 个版本

[PDF] upc.edu

Designing and modelling selective replication for fault-tolerant hpc applications

O Subasi, G Yalcin, F Zyulkyarov… - 2017 17th IEEE/ACM …, 2017 - ieeexplore.ieee.org

Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for
High Performance Computing (HPC) applications. There are studies that address fail-stop …

被引用次数：38 相关文章所有 8 个版本

[PDF] unlp.edu.ar

Characterizing a detection strategy for transient faults in hpc

DM Montezanti, D Rexachs del Rosario, E Rucci… - 2016 - sedici.unlp.edu.ar

Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger
detection intervals and silent faults are expected in the future. It is projected that, in exascale …

被引用次数：4 相关文章所有 4 个版本

[PDF] ethz.ch

Assessing HPC failure detectors for MPI jobs

K Kharbas, D Kim, T Hoefler… - 2012 20th Euromicro …, 2012 - ieeexplore.ieee.org

Reliability is one of the challenges faced by exascale computing. Components are poised to
fail during large-scale executions given current mean time between failure (MTBF) …

被引用次数：15 相关文章所有 24 个版本

[PDF] arxiv.org

Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

D Montezanti, E Rucci, A De Giusti, M Naiouf… - Future Generation …, 2020 - Elsevier

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that
silent undetected errors will occur several times a day, increasing the occurrence of …

被引用次数：6 相关文章所有 10 个版本

[PDF] ncsu.edu

Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org

Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

被引用次数：203 相关文章所有 20 个版本

Opportunistic application-level fault detection through adaptive redundant multithreading

S Hukerikar, PC Diniz, RF Lucas… - … Conference on High …, 2014 - ieeexplore.ieee.org

As the scale and complexity of future High Performance Computing systems continues to
grow, the rising frequency of faults and errors and their impact on HPC applications will …

被引用次数：27 相关文章所有 2 个版本

[PDF] semanticscholar.org

A case study of designing efficient algorithm-based fault tolerant application for exascale parallelism

E Yao, R Wang, M Chen, G Tan… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org

Fault tolerance overhead of high performance computing (HPC) applications is becoming
critical to the efficient utilization of HPC systems at large scale. Today's HPC applications …

被引用次数：15 相关文章所有 6 个版本

Evaluating the error resilience of parallel programs

B Fang, K Pattabiraman, M Ripeanu… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org

As a consequence of increasing hardware fault rates, HPC systems face significant
challenges in terms of reliability. Evaluating the error resilience of HPC applications is an …

被引用次数：16 相关文章所有 3 个版本

[PDF] ucf.edu

Understanding the propagation of transient errors in HPC applications

RA Ashraf, R Gioiosa, G Kestor, RF DeMara… - Proceedings of the …, 2015 - dl.acm.org

Resiliency of exascale systems has quickly become an important concern for the scientific
community. Despite its importance, still much remains to be determined regarding how faults …

被引用次数：112 相关文章所有 8 个版本

高级搜索

QQ 群