Gremlin: Systematic resilience testing of microservices

V Heorhiadi, S Rajagopalan, H Jamjoom… - 2016 IEEE 36th …, 2016 - ieeexplore.ieee.org
Modern Internet applications are being disaggregated into a microservice-based
architecture, with services being updated and deployed hundreds of times a day. The …

{SAMC}:{Semantic-Aware} Model Checking for Fast Discovery of Deep Bugs in Cloud Systems

T Leesatapornwongsa, M Hao, P Joshi… - … USENIX Symposium on …, 2014 - usenix.org
The last five years have seen a rise of implementationlevel distributed system model
checkers (dmck) for verifying the reliability of real distributed systems. Existing dmcks …

An empirical study on crash recovery bugs in large-scale distributed systems

Y Gao, W Dou, F Qin, C Gao, D Wang, J Wei… - Proceedings of the …, 2018 - dl.acm.org
In large-scale distributed systems, node crashes are inevitable, and can happen at any time.
As such, distributed systems are usually designed to be resilient to these node crashes via …

Flymc: Highly scalable testing of complex interleavings in distributed systems

JF Lukman, H Ke, CA Stuardo, RO Suminto… - Proceedings of the …, 2019 - dl.acm.org
We present a fast and scalable testing approach for datacenter/cloud systems such as
Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability …

Service-level fault injection testing

CS Meiklejohn, A Estrada, Y Song, H Miller… - Proceedings of the …, 2021 - dl.acm.org
Companies today increasingly rely on microservice architectures to deliver service for their
large-scale mobile or web applications. However, not all developers working on these …

Crashtuner: Detecting crash-recovery bugs in cloud systems via meta-info analysis

J Lu, C Liu, L Li, X Feng, F Tan, J Yang… - Proceedings of the 27th …, 2019 - dl.acm.org
Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most
severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult …

A study of failure recovery and logging of high-performance parallel file systems

R Han, OR Gatla, M Zheng, J Cao, D Zhang… - ACM Transactions on …, 2022 - dl.acm.org
Large-scale parallel file systems (PFSs) play an essential role in high-performance
computing (HPC). However, despite their importance, their reliability is much less studied or …

FCatch: Automatically detecting time-of-fault bugs in cloud systems

H Liu, X Wang, G Li, S Lu, F Ye, C Tian - ACM SIGPLAN Notices, 2018 - dl.acm.org
It is crucial for distributed systems to achieve high availability. Unfortunately, this is
challenging given the common component failures (ie, faults). Developers often cannot …

Cloudraid: hunting concurrency bugs in the cloud via log-mining

J Lu, F Li, L Li, X Feng - Proceedings of the 2018 26th ACM joint meeting …, 2018 - dl.acm.org
Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to
detect and often lead to data loss and service outage. This paper presents CloudRaid, a …

Switching gaussian process dynamic models for simultaneous composite motion tracking and recognition

J Chen, M Kim, Y Wang, Q Ji - 2009 IEEE Conference on …, 2009 - ieeexplore.ieee.org
Traditional dynamical systems used for motion tracking cannot effectively handle high
dimensionality of the motion states and composite dynamics. In this paper, to address both …