{RL-Watchdog}: A Fast and Predictable {SSD} Liveness Watchdog on Storage Systems

JY Ha, S Lee, HY Yeom, Y Son - 2024 USENIX Annual Technical …, 2024 - usenix.org
This paper proposes a reinforcement learning-based watchdog (RLW) that examines solid-
state drive (SSD) liveness or failures by faults (eg, controller/power faults and high …

Design and evaluation of a risk-aware failure identification scheme for improved ras in erasure-coded data centers

W Huang, J Fang, S Wan, C Xie… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are
highly affected by data repair induced by node failures. In a traditional failure identification …

[图书][B] Towards Scale-Checkable Systems

CAS Moraga - 2022 - search.proquest.com
In this document, we present our approaches for understanding and discovering scalability
faults, ie faults whose symptoms appear at larger scales but are not visible at smaller scales …

[图书][B] Automatically Fixing Performance Bugs and Extracting Bug Signatures for Cloud Systems

J He - 2021 - search.proquest.com
Cloud systems are becoming increasingly complex and performance bugs are inevitable.
Performance bugs are notoriously difficult to debug and fix due to lack of diagnostic …

[PDF][PDF] SCALEVIEW: Identifying and Analyzing Potential Scalability Faults in Large-Scale Distributed Systems Draft–Private View Only

CA Stuardo, HN Zhu, PJ Chapman, C Rubio-Gonzalez… - people.cs.uchicago.edu
We present SCALEVIEW, a framework for identifying and analyzing potential scalability
faults in large-scale distributed systems. SCALEVIEW combines instrumentation and …

TFix+: Self-configuring Hybrid Timeout Bug Fixing for Cloud Systems

J He, T Dai, X Gu - arXiv preprint arXiv:2110.04101, 2021 - arxiv.org
Timeout bugs can cause serious availability and performance issues which are often difficult
to fix due to the lack of diagnostic information. Previous work proposed solutions for fixing …

Efficient data and metadata processing in large-scale distributed systems

R Shi - 2018 - rave.ohiolink.edu
Research for large-scale system is challenging because deploying a large system needs a
great amount of resources. My approach to address this problem is based on the …

テイル・レイテンシ削減のためのハードウェアIRQ ハンドラにおけるパケット処理

菊地隆文, 名取廣, 河野健二 - 研究報告システムソフトウェアと …, 2020 - ipsj.ixsq.nii.ac.jp
論文抄録 現代の情報サービスは, 分散システムが基盤となっている. 分散システムにおいて,
大規模な障害が発生すると, サービスの停止につながる. したがって, 分散システムの信頼性を向上 …