Zeno: A straggler diagnosis system for distributed computing using machine learning

H Shen, C Li - … Computing: 33rd International Conference, ISC High …, 2018 - Springer
H Shen, C Li
High Performance Computing: 33rd International Conference, ISC High …, 2018Springer
Modern distributed computing frameworks for cloud computing and high performance
computing typically accelerate job performance by dividing a large job into small tasks for
execution parallelism. Some tasks, however, may run far behind others, which jeopardize
the job completion time. In this paper, we present Zeno, a novel system which automatically
identifies and diagnoses stragglers for jobs by machine learning methods. First, the system
identifies stragglers with an unsupervised clustering method which groups the tasks based …
Abstract
Modern distributed computing frameworks for cloud computing and high performance computing typically accelerate job performance by dividing a large job into small tasks for execution parallelism. Some tasks, however, may run far behind others, which jeopardize the job completion time. In this paper, we present Zeno, a novel system which automatically identifies and diagnoses stragglers for jobs by machine learning methods. First, the system identifies stragglers with an unsupervised clustering method which groups the tasks based on their execution time. It then uses a supervised rule learning algorithm to learn diagnosis rules inferring the stragglers with their resource assignment and usage data. Zeno is evaluated on traces from a Google’s Borg system and an Alibaba’s Fuxi system. The results demonstrate that our system is able to generate simple and easy-to-read rules with both valuable insights and decent performance in predicting stragglers.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果