Accurate timeout detection despite arbitrary processing delays

JY Ha, S Lee, HY Yeom, Y Son - 2024 USENIX Annual Technical …, 2024 - usenix.org

This paper proposes a reinforcement learning-based watchdog (RLW) that examines solid-
state drive (SSD) liveness or failures by faults (eg, controller/power faults and high …

[PDF] ieee.org

Design and evaluation of a risk-aware failure identification scheme for improved ras in erasure-coded data centers

W Huang, J Fang, S Wan, C Xie… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org

Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are
highly affected by data repair induced by node failures. In a traditional failure identification …

被引用次数：4 相关文章所有 4 个版本

[HTML] proquest.com

[图书][B] Towards Scale-Checkable Systems

CAS Moraga - 2022 - search.proquest.com

In this document, we present our approaches for understanding and discovering scalability
faults, ie faults whose symptoms appear at larger scales but are not visible at smaller scales …

[图书][B] Automatically Fixing Performance Bugs and Extracting Bug Signatures for Cloud Systems

J He - 2021 - search.proquest.com

Cloud systems are becoming increasingly complex and performance bugs are inevitable.
Performance bugs are notoriously difficult to debug and fix due to lack of diagnostic …

[PDF][PDF] SCALEVIEW: Identifying and Analyzing Potential Scalability Faults in Large-Scale Distributed Systems Draft–Private View Only

CA Stuardo, HN Zhu, PJ Chapman, C Rubio-Gonzalez… - people.cs.uchicago.edu

We present SCALEVIEW, a framework for identifying and analyzing potential scalability
faults in large-scale distributed systems. SCALEVIEW combines instrumentation and …

[PDF] arxiv.org

TFix+: Self-configuring Hybrid Timeout Bug Fixing for Cloud Systems

J He, T Dai, X Gu - arXiv preprint arXiv:2110.04101, 2021 - arxiv.org

Timeout bugs can cause serious availability and performance issues which are often difficult
to fix due to the lack of diagnostic information. Previous work proposed solutions for fixing …

被引用次数：1 相关文章所有 2 个版本

[PDF] ohiolink.edu

Efficient data and metadata processing in large-scale distributed systems

R Shi - 2018 - rave.ohiolink.edu

Research for large-scale system is challenging because deploying a large system needs a
great amount of resources. My approach to address this problem is based on the …

テイル・レイテンシ削減のためのハードウェアIRQ ハンドラにおけるパケット処理

菊地隆文，名取廣，河野健二 - 研究報告システムソフトウェアと …, 2020 - ipsj.ixsq.nii.ac.jp

論文抄録現代の情報サービスは, 分散システムが基盤となっている. 分散システムにおいて,
大規模な障害が発生すると, サービスの停止につながる. したがって, 分散システムの信頼性を向上 …

高级搜索

QQ 群