Fault-aware job scheduling for bluegene/l systems

A Oliner, J Stearley - 37th annual IEEE/IFIP international …, 2007 - ieeexplore.ieee.org

If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …

被引用次数：674 相关文章所有 8 个版本

[PDF] researchgate.net

Failure prediction in ibm bluegene/l event logs

Y Liang, Y Zhang, H Xiong… - … Conference on Data …, 2007 - ieeexplore.ieee.org

Frequent failures are becoming a serious concern to the community of high-end computing,
especially when the applications and the underlying systems rapidly grow in size and …

被引用次数：332 相关文章所有 14 个版本

[PDF] ncsu.edu

Proactive fault tolerance for HPC with Xen virtualization

AB Nagarajan, F Mueller, C Engelmann… - Proceedings of the 21st …, 2007 - dl.acm.org

Large-scale parallel computing is relying increasingly on clusters with thousands of
processors. At such large counts of compute nodes, faults are becoming common place …

被引用次数：527 相关文章所有 20 个版本

[PDF] rutgers.edu

Bluegene/l failure analysis and prediction models

Y Liang, Y Zhang, A Sivasubramaniam… - … and Networks (DSN' …, 2006 - ieeexplore.ieee.org

The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …

被引用次数：402 相关文章所有 14 个版本

[PDF] psu.edu

Proactive process-level live migration in HPC environments

C Wang, F Mueller, C Engelmann… - SC'08: Proceedings of …, 2008 - ieeexplore.ieee.org

As the number of nodes in high-performance computing environments keeps increasing,
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …

被引用次数：250 相关文章所有 18 个版本

[PDF] semanticscholar.org

Failure analysis of jobs in compute clouds: A google cluster case study

X Chen, CD Lu, K Pattabiraman - 2014 IEEE 25th International …, 2014 - ieeexplore.ieee.org

In this paper, we analyze a workload trace from the Google cloud cluster and characterize
the observed failures. The goal of our work is to improve the understanding of failures in …

被引用次数：122 相关文章所有 7 个版本

[PDF] researchgate.net

[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.

Q Guan, Z Zhang, S Fu - J. Commun., 2012 - researchgate.net

In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …

被引用次数：116 相关文章所有 4 个版本

[PDF] iit.edu

Fault-aware, utility-based job scheduling on blue, gene/p systems

W Tang, Z Lan, N Desai… - 2009 IEEE International …, 2009 - ieeexplore.ieee.org

Job scheduling on large-scale systems is an increasingly complicated affair, with numerous
factors influencing scheduling policy. Addressing these concerns results in sophisticated …

被引用次数：113 相关文章所有 9 个版本

[PDF] psu.edu

Co-analysis of RAS log and job log on Blue Gene/P

Z Zheng, L Yu, W Tang, Z Lan, R Gupta… - … parallel & distributed …, 2011 - ieeexplore.ieee.org

With the growth of system size and complexity, reliability has become of paramount
importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs …

被引用次数：103 相关文章所有 11 个版本

[PDF] microsoft.com

P-packSVM: Parallel primal gradient descent kernel SVM

AZ Zeyuan, C Weizhu, W Gang… - 2009 Ninth IEEE …, 2009 - ieeexplore.ieee.org

It is an extreme challenge to produce a nonlinear SVM classifier on very large scale data. In
this paper we describe a novel P-packSVM algorithm that can solve the Support Vector …

被引用次数：102 相关文章所有 15 个版本

高级搜索

QQ 群