相关文章- 学术资源搜索

UnSync: A soft error resilient redundant multicore architecture

R Jeyapaul, F Hong, A Rhisheekesan… - 2011 International …, 2011 - ieeexplore.ieee.org

Reducing device dimensions, increasing transistor densities, and smaller timing windows,
expose the vulnerability of processors to soft errors induced by charge carrying particles …

被引用次数：11 相关文章所有 10 个版本

[PDF] illinois.edu

Understanding the propagation of error due to a silent data corruption in a sparse matrix vector multiply

J Calhoun, M Snir, L Olson… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org

With the rate of errors that silently effect an application's state/output expected to increase in
future HPC machines, numerous mitigation schemes have been proposed, but little work …

被引用次数：13 相关文章所有 10 个版本

[PDF] ncsu.edu

Systemic assessment of node failures in HPC production platforms

A Das, F Mueller, B Rountree - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org

Production HPC clusters endure failures reducing computational capability and resource
availability. Despite the presence of various failure prediction schemes for large-scale …

被引用次数：8 相关文章所有 5 个版本

A case for adaptive redundancy for HPC resilience

S Hukerikar, PC Diniz, RF Lucas - European Conference on Parallel …, 2013 - Springer

Redundancy both in space and time has been widely used to detect and in some cases
correct errors in High Performance Computing (HPC) systems. With the HPC community …

被引用次数：9 相关文章

Experiences with a private enterprise cloud: Providing fault tolerance and high availability for interactive eda applications

V Kamath, R Giri, R Muralidhar - 2013 IEEE Sixth International …, 2013 - ieeexplore.ieee.org

Silicon Design and Electronic Design Automation (EDA) business is highly competitive and
time to market is of utmost importance in the semiconductor industry where companies put in …

被引用次数：12 相关文章所有 4 个版本

[PDF] researchgate.net

Evaluation of simple causal message logging for large-scale fault tolerant HPC systems

E Meneses, G Bronevetsky… - 2011 IEEE International …, 2011 - ieeexplore.ieee.org

The era of petascale computing brought machines with hundreds of thousands of
processors. The next generation of exascale supercomputers will make available clusters …

被引用次数：30 相关文章所有 11 个版本

[PDF] osti.gov

Using probabilistic characterization to reduce runtime faults in HPC systems

J Brandt, B Debusschere, A Gentile… - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org

The current trend in high performance computing is to aggregate ever larger numbers of
processing and interconnection elements in order to achieve desired levels of computational …

被引用次数：24 相关文章所有 7 个版本

Fault injection experiments with the clamr hydrodynamics mini-app

B Atkinson, N Debardeleben, Q Guan… - 2014 IEEE …, 2014 - ieeexplore.ieee.org

In this paper, we present a resilience analysis of the impact of soft errors on CLAMR, a
hydrodynamics mini-app for high performance computing (HPC). We utilize F-SEFI, a fine …

被引用次数：14 相关文章所有 5 个版本

[PDF] semanticscholar.org

Active replication at (almost) no cost

A Martin, C Fetzer, A Brito - 2011 IEEE 30th International …, 2011 - ieeexplore.ieee.org

MapReduce has become a popular programming paradigm in the domain of batch
processing systems. Its simplicity allows applications to be highly scalable and to be easily …

被引用次数：49 相关文章所有 6 个版本

[PDF] arxiv.org

Algorithm-directed crash consistence in non-volatile memory for hpc

S Yang, K Wu, Y Qiao, D Li… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org

Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile
memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main …

被引用次数：14 相关文章所有 7 个版本

高级搜索

QQ 群