Wisckey: Separating keys from values in ssd-conscious storage

L Lu, TS Pillai, H Gopalakrishnan… - ACM Transactions On …, 2017 - dl.acm.org
We present WiscKey, a persistent LSM-tree-based key-value store with a performance-
oriented data layout that separates keys from values to minimize I/O amplification. The …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

Perseus: A {Fail-Slow} detection framework for cloud storage systems

R Lu, E Xu, Y Zhang, F Zhu, Z Zhu, M Wang… - … USENIX Conference on …, 2023 - usenix.org
The newly-emerging''fail-slow''failures plague both software and hardware where the victim
components are still functioning yet with degraded performance. To address this problem …

Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to file-system faults

A Ganesan, R Alagappan, AC Arpaci-Dusseau… - ACM Transactions on …, 2017 - dl.acm.org
We analyze how modern distributed storage systems behave in the presence of file-system
faults such as data corruption and read and write errors. We characterize eight popular …

What bugs cause production cloud incidents?

H Liu, S Lu, M Musuvathi, S Nath - Proceedings of the Workshop on Hot …, 2019 - dl.acm.org
Cloud services have become the backbone of today's computing world. Runtime incidents,
which adversely affect the expected service operations, are extremely costly in terms of user …

If At First You Don't Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

BA Stoica, U Sethi, Y Su, C Zhou, S Lu, J Mace… - Proceedings of the …, 2024 - dl.acm.org
Retry---the re-execution of a task on failure---is a common mechanism to enable resilient
software systems. Yet, despite its commonality and long history, retry remains difficult to …

An empirical study on crash recovery bugs in large-scale distributed systems

Y Gao, W Dou, F Qin, C Gao, D Wang, J Wei… - Proceedings of the …, 2018 - dl.acm.org
In large-scale distributed systems, node crashes are inevitable, and can happen at any time.
As such, distributed systems are usually designed to be resilient to these node crashes via …

[PDF][PDF] Consistency issue and related trade-offs in distributed replicated systems and databases: a review

J Ahmed, A Karpenko, O Tarasyuk, A Gorbenko… - 2023 - dspace.library.khai.edu
УДК: 621 Page 1 Information security and safety 171 UDC 004.75 doi: 10.32620/reks.2023.2.14
Jaafar AHMED1, Andrii KARPENKO2, Olga TARASYUK3,4, Anatoliy GORBENKO1,2, Akbar …

Automatic reliability testing for cluster management controllers

X Sun, W Luo, JT Gu, A Ganesan… - … USENIX Symposium on …, 2022 - usenix.org
Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation
principle to be highly resilient and extensible. In these systems, all cluster-management …

An {In-Depth} Study of Correlated Failures in Production {SSD-Based} Data Centers

S Han, PPC Lee, F Xu, Y Liu, C He, J Liu - 19th USENIX Conference on …, 2021 - usenix.org
Flash-based solid-state drives (SSDs) are increasingly adopted as the mainstream storage
media in modern data centers. However, little is known about how SSD failures in the field …