What supercomputers say: A study of five system logs

A Oliner, J Stearley - 37th annual IEEE/IFIP international …, 2007 - ieeexplore.ieee.org
If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …

Failure prediction in ibm bluegene/l event logs

Y Liang, Y Zhang, H Xiong… - … Conference on Data …, 2007 - ieeexplore.ieee.org
Frequent failures are becoming a serious concern to the community of high-end computing,
especially when the applications and the underlying systems rapidly grow in size and …

Proactive fault tolerance for HPC with Xen virtualization

AB Nagarajan, F Mueller, C Engelmann… - Proceedings of the 21st …, 2007 - dl.acm.org
Large-scale parallel computing is relying increasingly on clusters with thousands of
processors. At such large counts of compute nodes, faults are becoming common place …

Bluegene/l failure analysis and prediction models

Y Liang, Y Zhang, A Sivasubramaniam… - … and Networks (DSN' …, 2006 - ieeexplore.ieee.org
The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …

Proactive process-level live migration in HPC environments

C Wang, F Mueller, C Engelmann… - SC'08: Proceedings of …, 2008 - ieeexplore.ieee.org
As the number of nodes in high-performance computing environments keeps increasing,
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …

Failure analysis of jobs in compute clouds: A google cluster case study

X Chen, CD Lu, K Pattabiraman - 2014 IEEE 25th International …, 2014 - ieeexplore.ieee.org
In this paper, we analyze a workload trace from the Google cloud cluster and characterize
the observed failures. The goal of our work is to improve the understanding of failures in …

[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.

Q Guan, Z Zhang, S Fu - J. Commun., 2012 - researchgate.net
In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …

Fault-aware, utility-based job scheduling on blue, gene/p systems

W Tang, Z Lan, N Desai… - 2009 IEEE International …, 2009 - ieeexplore.ieee.org
Job scheduling on large-scale systems is an increasingly complicated affair, with numerous
factors influencing scheduling policy. Addressing these concerns results in sophisticated …

Co-analysis of RAS log and job log on Blue Gene/P

Z Zheng, L Yu, W Tang, Z Lan, R Gupta… - … parallel & distributed …, 2011 - ieeexplore.ieee.org
With the growth of system size and complexity, reliability has become of paramount
importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs …

P-packSVM: Parallel primal gradient descent kernel SVM

AZ Zeyuan, C Weizhu, W Gang… - 2009 Ninth IEEE …, 2009 - ieeexplore.ieee.org
It is an extreme challenge to produce a nonlinear SVM classifier on very large scale data. In
this paper we describe a novel P-packSVM algorithm that can solve the Support Vector …