Retro: Targeted resource management in multi-tenant distributed systems

J Mace, P Bodik, R Fonseca, M Musuvathi - 12th USENIX Symposium …, 2015 - usenix.org
In distributed systems shared by multiple tenants, effective resource management is an
important pre-requisite to providing quality of service guarantees. Many systems deployed …

Metastable failures in the wild

L Huang, M Magnusson, AB Muralikrishna… - … USENIX Symposium on …, 2022 - usenix.org
Recently, Bronson et al. introduced a framework for understanding a class of failures in
distributed systems called metastable failures. The examples of metastable failures …

Limplock: Understanding the impact of limpware on scale-out cloud systems

T Do, M Hao, T Leesatapornwongsa… - Proceedings of the 4th …, 2013 - dl.acm.org
We highlight one often-overlooked cause of performance failure: limpware--" limping"
hardware whose performance degrades significantly compared to its specification. We …

FTCloudSim: a simulation tool for cloud service reliability enhancement mechanisms

A Zhou, S Wang, Q Sun, H Zou, F Yang - … Demo & Poster Track of ACM …, 2013 - dl.acm.org
Recently, an increasing number of companies have deployed their application services in
the cloud. However, the cloud data center downtime has negative affected the quality of …

FTCloudSim: support for cloud service reliability enhancement simulation

A Zhou, S Wang, C Yang, L Sun… - … Journal of Web and …, 2015 - inderscienceonline.com
Recently, an increasing number of companies have begun to deploy their application
services in the cloud. However, the cloud data centre downtime has negatively affected the …

Monitoring Performance in Large Scale Computing Clouds with Passive Benchmarking

C Nieke, WT Balke - 2017 IEEE 10th International Conference …, 2017 - ieeexplore.ieee.org
Providers of computing services such as data science clouds need to maintain large
hardware infrastructures often with thousands of nodes. Using commodity hardware leads to …

Efficient Online Processing for Advanced Analytics

MEMA El Seidy - 2017 - infoscience.epfl.ch
With the advent of emerging technologies and the Internet of Things, the importance of
online data analytics has become more pronounced. Businesses and companies are …

Impact of Limpware on HDFS: A Probabilistic Estimation

T Do, HS Gunawi - arXiv preprint arXiv:1311.3322, 2013 - arxiv.org
With the advent of cloud computing, thousands of machines are connected and managed
collectively. This era is confronted with a new challenge: performance variability, primarily …

Squall: Scalable Real-time Analytics using Efficient, Skew-resilient Join Operators

A Vitorović - 2023 - infoscience.epfl.ch
Squall is a scalable online query engine that runs complex analytics in a cluster using skew-
resilient, adaptive operators. Online processing implies that results are incrementally built as …

Towards Reliable Cloud Systems

TD Do - 2014 - search.proquest.com
Although providing tremendous access to data and computing power of thousands of
commodity servers, large-scale cloud systems must address a new challenge: they must …