Toward automated anomaly identification in large-scale systems Z Lan, Z Zheng, Y Li IEEE Transactions on Parallel and Distributed Systems 21 (2), 174-187, 2010 | 146 | 2010 |
System log pre-processing to improve failure prediction Z Zheng, Z Lan, BH Park, A Geist Dependable Systems & Networks, 2009. DSN'09. IEEE/IFIP International …, 2009 | 142 | 2009 |
Co-analysis of RAS log and job log on Blue Gene/P Z Zheng, L Yu, W Tang, Z Lan, R Gupta, N Desai, S Coghlan, D Buettner Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International …, 2011 | 103 | 2011 |
Practical online failure prediction for blue gene/p: Period-based vs event-driven L Yu, Z Zheng, Z Lan, S Coghlan Dependable Systems and Networks Workshops (DSN-W), 2011 IEEE/IFIP 41st …, 2011 | 93 | 2011 |
A practical failure prediction with location and lead time for Blue Gene/P Z Zheng, Z Lan, R Gupta, S Coghlan, P Beckman Dependable Systems and Networks Workshops (DSN-W), 2010 International …, 2010 | 88 | 2010 |
Dynamic meta-learning for failure prediction in large-scale systems: A case study J Gu, Z Zheng, Z Lan, J White, E Hocks, BH Park Parallel Processing, 2008. ICPP'08. 37th International Conference on, 157-164, 2008 | 81 | 2008 |
When is multi-version checkpointing needed? G Lu, Z Zheng, AA Chien Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale …, 2013 | 75 | 2013 |
A study of dynamic meta-learning for failure prediction in large-scale systems Z Lan, J Gu, Z Zheng, R Thakur, S Coghlan Journal of Parallel and Distributed Computing 70 (6), 630-643, 2010 | 74 | 2010 |
Reliability-aware scalability models for high performance computing Z Zheng, Z Lan Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International …, 2009 | 72 | 2009 |
3-Dimensional root cause diagnosis via co-analysis Z Zheng, L Yu, Z Lan, T Jones Proceedings of the 9th international conference on Autonomic computing, 181-190, 2012 | 57 | 2012 |
Versioned distributed arrays for resilience in scientific applications: Global view resilience A Chien, P Balaji, P Beckman, N Dun, A Fang, H Fujita, K Iskra, ... Procedia Computer Science 51, 29-38, 2015 | 42 | 2015 |
Anomaly localization in large-scale clusters Z Zheng, Y Li, Z Lan Cluster Computing, 2007 IEEE International Conference on, 322-330, 2007 | 37 | 2007 |
Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart Z Zheng, L Yu, Z Lan IEEE Transactions on Computers 64 (5), 1402-1415, 2015 | 28 | 2015 |
Performance under failures of DAG-based parallel computing H Jin, XH Sun, Z Zheng, Z Lan, B Xie Cluster Computing and the Grid, 2009. CCGRID'09. 9th IEEE/ACM International …, 2009 | 23 | 2009 |
Fault tolerance in an inner-outer solver: a gvr-enabled case study Z Zheng, AA Chien, K Teranishi International Conference on High Performance Computing for Computational …, 2014 | 22 | 2014 |
Filtering log data: Finding the needles in the Haystack L Yu, Z Zheng, Z Lan, T Jones, JM Brandt, AC Gentile Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP …, 2012 | 22 | 2012 |
A fault diagnosis and prognosis service for teragrid clusters Z Lan, Y Li, P Gujrati, Z Zheng, R Thakur, J White Proc. of The 2nd TeraGrid Conference, 2007 | 20 | 2007 |
Exploring versioned distributed arrays for resilience in scientific applications: global view resilience A Chien, P Balaji, N Dun, A Fang, H Fujita, K Iskra, Z Rubenstein, Z Zheng, ... The International Journal of High Performance Computing Applications …, 2016 | 18 | 2016 |
Towards a faultaware computing environment XH Sun, Z Lan, Y Li, H Jin, Z Zheng Proceedings of the High Availability and Performance Computing Workshop (HAPCW), 2008 | 16 | 2008 |
Error checking and snapshot-based recovery in a preconditioned conjugate gradient solver Z Rubenstein, H Fujita, Z Zheng, A Chien Technical Report TR-2013-11, Department of Computer Science, University of …, 2013 | 9 | 2013 |