Probe: A thousand-node experimental cluster for computer systems research

D Duplyakin, R Ricci, A Maricq, G Wong… - 2019 USENIX annual …, 2019 - usenix.org

Given the highly empirical nature of research in cloud computing, networked systems, and
related fields, testbeds play an important role in the research ecosystem. In this paper, we …

被引用次数：652 相关文章所有 10 个版本

[PDF] acm.org

Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server

H Cui, H Zhang, GR Ganger, PB Gibbons… - Proceedings of the …, 2016 - dl.acm.org

Large-scale deep learning requires huge computational resources to train a multi-layer
neural network. Recent systems propose using 100s to 1000s of machines to train networks …

被引用次数：412 相关文章所有 15 个版本

[PDF] psu.edu

TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters

A Tumanov, T Zhu, JW Park, MA Kozuch… - Proceedings of the …, 2016 - dl.acm.org

TetriSched is a scheduler that works in tandem with a calendaring reservation system to
continuously re-evaluate the immediate-term scheduling plan for all pending jobs (including …

被引用次数：251 相关文章所有 11 个版本

[PDF] cmu.edu

IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion

K Ren, Q Zheng, S Patil… - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org

The growing size of modern storage systems is expected to exceed billions of objects,
making metadata scalability critical to overall performance. Many existing distributed file …

被引用次数：216 相关文章所有 18 个版本

[PDF] usenix.org

Exploiting bounded staleness to speed up big data analytics

H Cui, J Cipar, Q Ho, JK Kim, S Lee, A Kumar… - 2014 USENIX Annual …, 2014 - usenix.org

Many modern machine learning (ML) algorithms are iterative, converging on a final solution
via many iterations over the input data. This paper explores approaches to exploiting these …

被引用次数：212 相关文章所有 22 个版本

[PDF] academia.edu

Addressing the straggler problem for iterative convergent parallel ML

A Harlap, H Cui, W Dai, J Wei, GR Ganger… - Proceedings of the …, 2016 - dl.acm.org

FlexRR provides a scalable, efficient solution to the straggler problem for iterative machine
learning (ML). The frequent (eg, per iteration) barriers used in traditional BSP-based …

被引用次数：165 相关文章所有 9 个版本

[PDF] neurips.cc

On model parallelization and scheduling strategies for distributed machine learning

S Lee, JK Kim, X Zheng, Q Ho… - Advances in neural …, 2014 - proceedings.neurips.cc

Distributed machine learning has typically been approached from a data parallel
perspective, where big data are partitioned to multiple workers and an algorithm is executed …

被引用次数：175 相关文章所有 18 个版本

[PDF] usenix.org

{SAMC}:{Semantic-Aware} Model Checking for Fast Discovery of Deep Bugs in Cloud Systems

T Leesatapornwongsa, M Hao, P Joshi… - … USENIX Symposium on …, 2014 - usenix.org

The last five years have seen a rise of implementationlevel distributed system model
checkers (dmck) for verifying the reliability of real distributed systems. Existing dmcks …

被引用次数：181 相关文章所有 9 个版本

[PDF] acm.org

Managed communication and consistency for fast data-parallel iterative analytics

J Wei, W Dai, A Qiao, Q Ho, H Cui, GR Ganger… - Proceedings of the …, 2015 - dl.acm.org

At the core of Machine Learning (ML) analytics is often an expert-suggested model, whose
parameters are refined by iteratively processing a training dataset until convergence. The …

被引用次数：150 相关文章所有 14 个版本

[PDF] researchgate.net

Optimizing load balancing and data-locality with data-aware scheduling

K Wang, X Zhou, T Li, D Zhao, M Lang… - … Conference on Big …, 2014 - ieeexplore.ieee.org

Load balancing techniques (eg work stealing) are important to obtain the best performance
for distributed task scheduling systems that have multiple schedulers making scheduling …

被引用次数：171 相关文章所有 12 个版本

高级搜索

QQ 群