CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows
Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream proc...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Stream processing in the last decade has seen broad adoption in both
commercial and research settings. One key element for this success is the
ability of modern stream processors to handle failures while ensuring
exactly-once processing guarantees. At the moment of writing, virtually all
stream processors that guarantee exactly-once processing implement a variant of
Apache Flink's coordinated checkpoints - an extension of the original
Chandy-Lamport checkpoints from 1985. However, the reasons behind this
prevalence of the coordinated approach remain anecdotal, as reported by
practitioners of the stream processing community. At the same time, common
checkpointing approaches, such as the uncoordinated and the
communication-induced ones, remain largely unexplored.
This paper is the first to address this gap by i) shedding light on why
practitioners have favored the coordinated approach and ii) by investigating
whether there are viable alternatives. To this end, we implement three
checkpointing approaches that we surveyed and adapted for the distinct needs of
streaming dataflows. Our analysis shows that the coordinated approach
outperforms the uncoordinated and communication-induced protocols under
uniformly distributed workloads. To our surprise, however, the uncoordinated
approach is not only competitive to the coordinated one in uniformly
distributed workloads, but it also outperforms the coordinated approach in
skewed workloads. We conclude that rather than blindly employing coordinated
checkpointing, research should focus on optimizing the very promising
uncoordinated approach, as it can address issues with skew and support
prevalent cyclic queries. We believe that our findings can trigger further
research into checkpointing mechanisms. |
---|---|
DOI: | 10.48550/arxiv.2403.13629 |