Nova-fortis: A fault-tolerant non-volatile main memory file system

J Xu, L Zhang, A Memaripour… - Proceedings of the 26th …, 2017 - dl.acm.org
Emerging fast, persistent memories will enable systems that combine conventional DRAM
with large amounts of non-volatile main memory (NVMM) and provide huge increases in …

Fail-slow at scale: Evidence of hardware performance faults in large production systems

HS Gunawi, RO Suminto, R Sears, C Golliher… - ACM Transactions on …, 2018 - dl.acm.org
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of
fail-slow hardware incidents, collected from large-scale cluster deployments in 14 …

What bugs live in the cloud? a study of 3000+ issues in cloud systems

HS Gunawi, M Hao, T Leesatapornwongsa… - Proceedings of the …, 2014 - dl.acm.org
We conduct a comprehensive study of development and deployment issues of six popular
and important cloud systems (Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper …

{SAMC}:{Semantic-Aware} Model Checking for Fast Discovery of Deep Bugs in Cloud Systems

T Leesatapornwongsa, M Hao, P Joshi… - … USENIX Symposium on …, 2014 - usenix.org
The last five years have seen a rise of implementationlevel distributed system model
checkers (dmck) for verifying the reliability of real distributed systems. Existing dmcks …

Who's afraid of uncorrectable bit errors? online recovery of flash errors with distributed redundancy

A Tai, A Kryczka, SO Kanaujia, K Jamieson… - 2019 USENIX Annual …, 2019 - usenix.org
Due to its high performance and decreasing cost per bit, flash storage is the main storage
medium in datacenters for hot data. However, flash endurance is a perpetual problem, and …

Protocol-aware recovery for consensus-based distributed storage

R Alagappan, A Ganesan, E Lee… - ACM Transactions on …, 2018 - dl.acm.org
We introduce protocol-aware recovery (Par), a new approach that exploits protocol-specific
knowledge to correctly recover from storage faults in distributed systems. We demonstrate …

Understanding issue correlations: a case study of the hadoop system

J Huang, X Zhang, K Schwan - … of the Sixth ACM Symposium on Cloud …, 2015 - dl.acm.org
Over the last decade, Hadoop has evolved into a widely used platform for Big Data
applications. Acknowledging its wide-spread use, we present a comprehensive analysis of …

Checking the integrity of transactional mechanisms

D Fryer, M Qin, J Sun, KW Lee, AD Brown… - ACM Transactions on …, 2014 - dl.acm.org
Data corruption is the most common consequence of file-system bugs. When such
corruption occurs, offline check and recovery tools must be used, but they are error prone …

Optimizing Hadoop framework for solid state drives

J Hong, L Li, C Han, B Jin, Q Yang… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
Solid state drives (SSDs) have been widely used in Hadoop clusters ever since their
introduction to the big data industry. However, the current Hadoop framework is not …

Adaptive metric nearest neighbor classification

C Domeniconi, J Peng… - … IEEE Conference on …, 2000 - ieeexplore.ieee.org
Nearest neighbor classification assumes locally constant class conditional probabilities. This
assumption becomes invalid in high dimensions with finite samples due to the curse of …