The design and operation of {CloudLab}

D Duplyakin, R Ricci, A Maricq, G Wong… - 2019 USENIX annual …, 2019 - usenix.org
Given the highly empirical nature of research in cloud computing, networked systems, and
related fields, testbeds play an important role in the research ecosystem. In this paper, we …

Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server

H Cui, H Zhang, GR Ganger, PB Gibbons… - Proceedings of the …, 2016 - dl.acm.org
Large-scale deep learning requires huge computational resources to train a multi-layer
neural network. Recent systems propose using 100s to 1000s of machines to train networks …

TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters

A Tumanov, T Zhu, JW Park, MA Kozuch… - Proceedings of the …, 2016 - dl.acm.org
TetriSched is a scheduler that works in tandem with a calendaring reservation system to
continuously re-evaluate the immediate-term scheduling plan for all pending jobs (including …

IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion

K Ren, Q Zheng, S Patil… - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org
The growing size of modern storage systems is expected to exceed billions of objects,
making metadata scalability critical to overall performance. Many existing distributed file …

Exploiting bounded staleness to speed up big data analytics

H Cui, J Cipar, Q Ho, JK Kim, S Lee, A Kumar… - 2014 USENIX Annual …, 2014 - usenix.org
Many modern machine learning (ML) algorithms are iterative, converging on a final solution
via many iterations over the input data. This paper explores approaches to exploiting these …

Addressing the straggler problem for iterative convergent parallel ML

A Harlap, H Cui, W Dai, J Wei, GR Ganger… - Proceedings of the …, 2016 - dl.acm.org
FlexRR provides a scalable, efficient solution to the straggler problem for iterative machine
learning (ML). The frequent (eg, per iteration) barriers used in traditional BSP-based …

On model parallelization and scheduling strategies for distributed machine learning

S Lee, JK Kim, X Zheng, Q Ho… - Advances in neural …, 2014 - proceedings.neurips.cc
Distributed machine learning has typically been approached from a data parallel
perspective, where big data are partitioned to multiple workers and an algorithm is executed …

{SAMC}:{Semantic-Aware} Model Checking for Fast Discovery of Deep Bugs in Cloud Systems

T Leesatapornwongsa, M Hao, P Joshi… - … USENIX Symposium on …, 2014 - usenix.org
The last five years have seen a rise of implementationlevel distributed system model
checkers (dmck) for verifying the reliability of real distributed systems. Existing dmcks …

Managed communication and consistency for fast data-parallel iterative analytics

J Wei, W Dai, A Qiao, Q Ho, H Cui, GR Ganger… - Proceedings of the …, 2015 - dl.acm.org
At the core of Machine Learning (ML) analytics is often an expert-suggested model, whose
parameters are refined by iteratively processing a training dataset until convergence. The …

Optimizing load balancing and data-locality with data-aware scheduling

K Wang, X Zhou, T Li, D Zhao, M Lang… - … Conference on Big …, 2014 - ieeexplore.ieee.org
Load balancing techniques (eg work stealing) are important to obtain the best performance
for distributed task scheduling systems that have multiple schedulers making scheduling …