[PDF][PDF] International Journal of Networking and Computing–www. ijnc. org, ISSN 2185-2847 Volume X, Number Y, pages 1–26, January 20XX

A Benoit, A Cavelan, FM Ciorba, V Le Fevre, Y Robert - icl.utk.edu
Large-scale platforms currently experience errors from two different sources, namely fail-
stop errors (which interrupt the execution) and silent errors (which strike unnoticed and …

Mapping High Level Parallel Programming Models to Asynchronous Many-Task (AMT) Runtimes

SR Paul - 2019 - search.proquest.com
Abstract Asynchronous Many-Task (AMT) runtimes have recently been proposed as a
promising software foundation for managing the increasing complexity of node architectures …

Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability

O Subasi, R Tipireddy… - 2018 IEEE 25th …, 2018 - ieeexplore.ieee.org
Checkpointing is the most widely used technique in high-performance computing (HPC) to
ensure the application progress in the presence of failures. In this paper, we present …

EasyChoose: A Continuous Feature Extraction and Review Highlighting Scheme on Hadoop YARN

MC Lee, JC Lin, O Owe - 2018 IEEE 32nd International …, 2018 - ieeexplore.ieee.org
Today the Internet offers a massive amount of reviews and user experiences about a variety
of products from different manufacturers, ranging from smartphones, automobiles, and home …

Comparative analysis of soft-error detection strategies: a case study with iterative methods

G Kestor Gioiosa, B Ozcelik Mutlu, J Manzano… - 2018 - osti.gov
Undetected soft errors caused by transient bit flips can lead to silent data corruption (SDC),
an undesirable outcome where invalid results pass for valid ones. This has motivated the …

Resilient scheduling algorithms for large-scale platforms

V Le Fèvre - 2020 - theses.hal.science
This thesis focuses on a major problem for the HPC community: resilience. Computing
platforms are bigger and bigger in order to reach what we call exascale, ie a computing …

メモリアクセスパターン依存故障の注入のためのQEMU ベース故障注入器

小林佑矢, 實本英之, 野村哲弘… - 研究報告ハイパフォーマンス …, 2017 - ipsj.ixsq.nii.ac.jp
論文抄録 並列計算機の大規模化で, Silent Data Corruption (SDC) による信頼性低下が懸念され
ている. SDC は検出が困難な障害で, 対応にはコストがかかる. 適切な方法を構築・選択するには …

[引用][C] Resilient scheduling algorithms for large-scale platforms

O Beaumont - 2020 - Université de Pittsburgh Rapporteur …