We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led …
In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via …
To deal with large amounts of data while offering high availability, throughput, and low latency, cloud computing systems rely on distributed, partitioned, and replicated data stores …
Distributed systems are difficult to debug and understand. A key reason for this is distributed state, which is not easily accessible and must be pieced together from the states of the …
H Liu, X Wang, G Li, S Lu, F Ye, C Tian - ACM SIGPLAN Notices, 2018 - dl.acm.org
It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (ie, faults). Developers often cannot …
J Lu, F Li, L Li, X Feng - Proceedings of the 2018 26th ACM joint meeting …, 2018 - dl.acm.org
Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to detect and often lead to data loss and service outage. This paper presents CloudRaid, a …
A real-world distributed system is rarely implemented as a standalone monolithic system. Instead, it is composed of multiple independent interacting components that together ensure …
X Yuan, J Yang, R Gu - … : 30th International Conference, CAV 2018, Held …, 2018 - Springer
We present POS, a concurrency testing approach that samples the partial order of concurrent programs. POS uses a novel priority-based scheduling algorithm that …
P Godefroid, K Sen - Handbook of Model Checking, 2018 - Springer
Abstract Model checking and testing have a lot in common. Over the last two decades, significant progress has been made on how to broaden the scope of model checking from …