Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow

Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the VLDB Endowment 2021-09, Vol.14 (12), p.3135-3147
Hauptverfasser: Akidau, Tyler, Begoli, Edmon, Chernyak, Slava, Hueske, Fabian, Knight, Kathryn, Knowles, Kenneth, Mills, Daniel, Sotolongo, Dan
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks , a key tool for reasoning about temporal completeness in infinite streams. First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches: • Computing a single correct answer, as in notifications. • Reasoning about a lack of data, as in dip detection. • Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models. • Safely and punctually garbage collecting obsolete inputs and intermediate state. • Surfacing a reliable signal of overall pipeline health . Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.
ISSN:2150-8097
2150-8097
DOI:10.14778/3476311.3476389