[PDF][PDF] Clonos: Consistent High-Availability for Distributed Stream Processing through Causal Logging

PMF Silvestre - 2020 - asc.di.fct.unl.pt
2020asc.di.fct.unl.pt
Nowadays, distributed stream processing systems lie in the backbone of businesses, as a
backend for critical event-driven applications such as real-time fraud detection or stock
trading. Given their critical nature these systems should be expressive, performant, highly-
available and maintain state consistency after failure. However, current fault-tolerance
solutions forego one of these four requirements. Highly-available systems sacrifice either
consistency or expressiveness and often performance, while more reliable systems have …
Abstract
Nowadays, distributed stream processing systems lie in the backbone of businesses, as a backend for critical event-driven applications such as real-time fraud detection or stock trading. Given their critical nature these systems should be expressive, performant, highly-available and maintain state consistency after failure. However, current fault-tolerance solutions forego one of these four requirements. Highly-available systems sacrifice either consistency or expressiveness and often performance, while more reliable systems have slow and blocking recovery.
In this thesis, we describe Clonos, a highly-available stream processing system that instantly switches the execution of failed operators to passive standbys. Our approach is non-blocking as it uses localized recovery, treating only the failed operators. By additionally forcing recovering operators to replay nondeterministic events Clonos achieves consistent recovery. To manage nondeterminism, we adapt causal logging, a rollback recovery method that logs such events in-memory and propagates them causally, to the stream processing paradigm. Clonos is configurable, allowing one to trade-off overhead for safety. To implement Clonos we re-engineered the distributed runtime of Apache Flink, a state-of-the-art stream processing system. To evaluate the performance of Clonos in terms of throughput, latency, network bandwidth and recovery time we perform overhead and failure experiments using both realistic and synthetic workloads. Clonos delivers upwards of 10 times faster recovery times without blocking and with much lower latency, at the cost of 11% throughput overhead on realistic workloads, when compared to state-of-the-art reliable systems. Clonos is more expressive than past highly-available systems, supporting a much larger set of use-cases. Clonos’ use of causal logging also opens a plethora of new opportunities, such as transaction-less exactly-once delivery guarantees and consistent non-blocking reconfiguration.
asc.di.fct.unl.pt
以上显示的是最相近的搜索结果。 查看全部搜索结果

Google学术搜索按钮

example.edu/paper.pdf
搜索
获取 PDF 文件
引用
References