{REPT}: Reverse debugging of failures in deployed software

W Cui, X Ge, B Kasikci, B Niu, U Sharma… - … USENIX Symposium on …, 2018 - usenix.org
Debugging software failures in deployed systems is important because they impact real
users and customers. However, debugging such failures is notoriously hard in practice …

An analysis of {Network-Partitioning} failures in cloud systems

A Alquraan, H Takruri, M Alfatafta… - 13th USENIX Symposium …, 2018 - usenix.org
We present a comprehensive study of 136 system failures attributed to network-partitioning
faults from 25 widely used distributed systems. We found that the majority of the failures led …

An empirical study on crash recovery bugs in large-scale distributed systems

Y Gao, W Dou, F Qin, C Gao, D Wang, J Wei… - Proceedings of the …, 2018 - dl.acm.org
In large-scale distributed systems, node crashes are inevitable, and can happen at any time.
As such, distributed systems are usually designed to be resilient to these node crashes via …

Survivability: design, formal modeling, and validation of cloud storage systems using Maude

R Bobba, J Grov, I Gupta, S Liu… - Assured cloud …, 2018 - books.google.com
To deal with large amounts of data while offering high availability, throughput, and low
latency, cloud computing systems rely on distributed, partitioned, and replicated data stores …

Inferring and asserting distributed system invariants

S Grant, H Cech, I Beschastnikh - Proceedings of the 40th International …, 2018 - dl.acm.org
Distributed systems are difficult to debug and understand. A key reason for this is distributed
state, which is not easily accessible and must be pieced together from the states of the …

FCatch: Automatically detecting time-of-fault bugs in cloud systems

H Liu, X Wang, G Li, S Lu, F Ye, C Tian - ACM SIGPLAN Notices, 2018 - dl.acm.org
It is crucial for distributed systems to achieve high availability. Unfortunately, this is
challenging given the common component failures (ie, faults). Developers often cannot …

Cloudraid: hunting concurrency bugs in the cloud via log-mining

J Lu, F Li, L Li, X Feng - Proceedings of the 2018 26th ACM joint meeting …, 2018 - dl.acm.org
Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to
detect and often lead to data loss and service outage. This paper presents CloudRaid, a …

Compositional programming and testing of dynamic distributed systems

A Desai, A Phanishayee, S Qadeer… - Proceedings of the ACM …, 2018 - dl.acm.org
A real-world distributed system is rarely implemented as a standalone monolithic system.
Instead, it is composed of multiple independent interacting components that together ensure …

Partial order aware concurrency sampling

X Yuan, J Yang, R Gu - … : 30th International Conference, CAV 2018, Held …, 2018 - Springer
We present POS, a concurrency testing approach that samples the partial order of
concurrent programs. POS uses a novel priority-based scheduling algorithm that …

Combining model checking and testing

P Godefroid, K Sen - Handbook of Model Checking, 2018 - Springer
Abstract Model checking and testing have a lot in common. Over the last two decades,
significant progress has been made on how to broaden the scope of model checking from …