OVIS-2: A robust distributed architecture for scalable RAS

The Zoltan and Isorropia parallel toolkits for combinatorial scientific computing: Partitioning, ordering and coloring

EG Boman, ÜV Çatalyürek, C Chevalier… - Scientific …, 2012 - content.iospress.com

Partitioning and load balancing are important problems in scientific computing that can be
modeled as combinatorial problems using graphs or hypergraphs. The Zoltan toolkit was …

被引用次数：212 相关文章所有 11 个版本

[PDF] researchgate.net

Proactive fault tolerance using preemptive migration

C Engelmann, GR Vallee, T Naughton… - 2009 17th Euromicro …, 2009 - ieeexplore.ieee.org

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents
compute node failures from impacting running parallel applications by preemptively …

被引用次数：153 相关文章所有 17 个版本

[PDF] nsf.gov

Comprehensive resource use monitoring for HPC systems with TACC stats

T Evans, WL Barth, JC Browne… - … Workshop on HPC …, 2014 - ieeexplore.ieee.org

This paper reports on a comprehensive, fully automated resource use monitoring package,
TACC Stats, which enables both consultants, users and other stakeholders in an HPC …

被引用次数：95 相关文章所有 9 个版本

[PDF] susu.ru

A review of supercomputer performance monitoring systems

KS Stefanov, S Pawar, A Ranjan… - Supercomputing …, 2021 - superfri.susu.ru

Abstract High Performance Computing is now one of the emerging fields in computer
science and its applications. Top HPC facilities, supercomputers, offer great opportunities in …

被引用次数：9 相关文章所有 4 个版本

[PDF] osti.gov

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

J Brandt, A Gentile, J Mayo, P Pebay… - … on Parallel & …, 2009 - ieeexplore.ieee.org

Using the cloud computing paradigm, a host of companies promise to make huge compute
resources available to users on a pay-as-you-go basis. These resources can be configured …

被引用次数：100 相关文章所有 10 个版本

[PDF] nsf.gov

Monster: an out-of-the-box monitoring tool for high performance computing systems

J Li, G Ali, N Nguyen, J Hass, A Sill… - … on Cluster Computing …, 2020 - ieeexplore.ieee.org

Understanding the status of high-performance computing platforms and correlating
applications to resource usage provide insight into the interactions among platform …

被引用次数：23 相关文章所有 3 个版本

[PDF] psu.edu

Monitoring and predicting hardware failures in HPC clusters with FTB-IPMI

R Rajachandrasekar, X Besseron… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org

Fault-detection and prediction in HPC clusters and Cloud-computing systems are
increasingly challenging issues. Several system middleware such as job schedulers and …

被引用次数：44 相关文章所有 7 个版本

[PDF] psu.edu

Window-based, discontinuity preserving stereo

M Agrawal, LS Davis - Proceedings of the 2004 IEEE Computer …, 2004 - ieeexplore.ieee.org

Traditionally, the problem of stereo matching has been addressed either by a local window-
based approach or a dense pixel-based approach using global optimization. In this paper …

被引用次数：73 相关文章所有 10 个版本

[PDF] archive.org

Extending LDMS to enable performance monitoring in multi-core applications

S Feldman, D Zhang, D Dechev… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org

Identifying design patterns that limit the performance of multi-core algorithms is a
challenging task. There are many known methods by which threads synchronize their …

被引用次数：26 相关文章所有 5 个版本

Understanding application and system performance through system-wide monitoring

RT Evans, JC Browne, WL Barth - 2016 IEEE International …, 2016 - ieeexplore.ieee.org

TACC Stats is a continuous monitoring tool for HPC systems that collects data at the core
and process level for every job executing on a monitored system. That data can be …

被引用次数：20 相关文章所有 2 个版本

高级搜索

QQ 群