Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow

Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the VLDB Endowment 2021-09, Vol.14 (12), p.3135-3147
Hauptverfasser: Akidau, Tyler, Begoli, Edmon, Chernyak, Slava, Hueske, Fabian, Knight, Kathryn, Knowles, Kenneth, Mills, Daniel, Sotolongo, Dan
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3147
container_issue 12
container_start_page 3135
container_title Proceedings of the VLDB Endowment
container_volume 14
creator Akidau, Tyler
Begoli, Edmon
Chernyak, Slava
Hueske, Fabian
Knight, Kathryn
Knowles, Kenneth
Mills, Daniel
Sotolongo, Dan
description Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks , a key tool for reasoning about temporal completeness in infinite streams. First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches: • Computing a single correct answer, as in notifications. • Reasoning about a lack of data, as in dip detection. • Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models. • Safely and punctually garbage collecting obsolete inputs and intermediate state. • Surfacing a reliable signal of overall pipeline health . Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.
doi_str_mv 10.14778/3476311.3476389
format Article
fullrecord <record><control><sourceid>crossref_osti_</sourceid><recordid>TN_cdi_osti_scitechconnect_1823361</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_14778_3476311_3476389</sourcerecordid><originalsourceid>FETCH-LOGICAL-c223t-287b39379e11595d5d6b8aae9222c6463f7f3160a7f59ce17c5f646d9f00a3ec3</originalsourceid><addsrcrecordid>eNpNkL1PwzAUxC0EEqWwM0YsTCl-fvHXiCq-pEosIEbLdWwIkKTy89L_nqjNwHSn0-mk-zF2DXwFjdbmDhutEGB1UGNP2EKA5LXhVp_-8-fsguibc2UUmAW7_fAl5t7nH6q6oaKSo--rXR5DJOqGz4r2VGJPl-ws-V-KV7Mu2fvjw9v6ud68Pr2s7zd1EAJLLYzeokVtI4C0spWt2hrvoxVCBNUoTDohKO51kjZE0EGmKW5t4txjDLhkN8fdkUrnKHQlhq8wDkMMxYERiAqmEj-WQh6Jckxul7vpw94BdwcabqbhZhr4B1Q2UJ0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow</title><source>ACM Digital Library Complete</source><creator>Akidau, Tyler ; Begoli, Edmon ; Chernyak, Slava ; Hueske, Fabian ; Knight, Kathryn ; Knowles, Kenneth ; Mills, Daniel ; Sotolongo, Dan</creator><creatorcontrib>Akidau, Tyler ; Begoli, Edmon ; Chernyak, Slava ; Hueske, Fabian ; Knight, Kathryn ; Knowles, Kenneth ; Mills, Daniel ; Sotolongo, Dan ; Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</creatorcontrib><description>Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks , a key tool for reasoning about temporal completeness in infinite streams. First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches: • Computing a single correct answer, as in notifications. • Reasoning about a lack of data, as in dip detection. • Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models. • Safely and punctually garbage collecting obsolete inputs and intermediate state. • Surfacing a reliable signal of overall pipeline health . Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.</description><identifier>ISSN: 2150-8097</identifier><identifier>EISSN: 2150-8097</identifier><identifier>DOI: 10.14778/3476311.3476389</identifier><language>eng</language><publisher>United States</publisher><ispartof>Proceedings of the VLDB Endowment, 2021-09, Vol.14 (12), p.3135-3147</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c223t-287b39379e11595d5d6b8aae9222c6463f7f3160a7f59ce17c5f646d9f00a3ec3</cites><orcidid>0000000221733663</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,309,314,780,784,789,885,23928,27922,27923</link.rule.ids><backlink>$$Uhttps://www.osti.gov/servlets/purl/1823361$$D View this record in Osti.gov$$Hfree_for_read</backlink></links><search><creatorcontrib>Akidau, Tyler</creatorcontrib><creatorcontrib>Begoli, Edmon</creatorcontrib><creatorcontrib>Chernyak, Slava</creatorcontrib><creatorcontrib>Hueske, Fabian</creatorcontrib><creatorcontrib>Knight, Kathryn</creatorcontrib><creatorcontrib>Knowles, Kenneth</creatorcontrib><creatorcontrib>Mills, Daniel</creatorcontrib><creatorcontrib>Sotolongo, Dan</creatorcontrib><creatorcontrib>Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</creatorcontrib><title>Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow</title><title>Proceedings of the VLDB Endowment</title><description>Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks , a key tool for reasoning about temporal completeness in infinite streams. First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches: • Computing a single correct answer, as in notifications. • Reasoning about a lack of data, as in dip detection. • Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models. • Safely and punctually garbage collecting obsolete inputs and intermediate state. • Surfacing a reliable signal of overall pipeline health . Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.</description><issn>2150-8097</issn><issn>2150-8097</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNpNkL1PwzAUxC0EEqWwM0YsTCl-fvHXiCq-pEosIEbLdWwIkKTy89L_nqjNwHSn0-mk-zF2DXwFjdbmDhutEGB1UGNP2EKA5LXhVp_-8-fsguibc2UUmAW7_fAl5t7nH6q6oaKSo--rXR5DJOqGz4r2VGJPl-ws-V-KV7Mu2fvjw9v6ud68Pr2s7zd1EAJLLYzeokVtI4C0spWt2hrvoxVCBNUoTDohKO51kjZE0EGmKW5t4txjDLhkN8fdkUrnKHQlhq8wDkMMxYERiAqmEj-WQh6Jckxul7vpw94BdwcabqbhZhr4B1Q2UJ0</recordid><startdate>20210901</startdate><enddate>20210901</enddate><creator>Akidau, Tyler</creator><creator>Begoli, Edmon</creator><creator>Chernyak, Slava</creator><creator>Hueske, Fabian</creator><creator>Knight, Kathryn</creator><creator>Knowles, Kenneth</creator><creator>Mills, Daniel</creator><creator>Sotolongo, Dan</creator><scope>AAYXX</scope><scope>CITATION</scope><scope>OIOZB</scope><scope>OTOTI</scope><orcidid>https://orcid.org/0000000221733663</orcidid></search><sort><creationdate>20210901</creationdate><title>Watermarks in stream processing systems</title><author>Akidau, Tyler ; Begoli, Edmon ; Chernyak, Slava ; Hueske, Fabian ; Knight, Kathryn ; Knowles, Kenneth ; Mills, Daniel ; Sotolongo, Dan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c223t-287b39379e11595d5d6b8aae9222c6463f7f3160a7f59ce17c5f646d9f00a3ec3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Akidau, Tyler</creatorcontrib><creatorcontrib>Begoli, Edmon</creatorcontrib><creatorcontrib>Chernyak, Slava</creatorcontrib><creatorcontrib>Hueske, Fabian</creatorcontrib><creatorcontrib>Knight, Kathryn</creatorcontrib><creatorcontrib>Knowles, Kenneth</creatorcontrib><creatorcontrib>Mills, Daniel</creatorcontrib><creatorcontrib>Sotolongo, Dan</creatorcontrib><creatorcontrib>Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</creatorcontrib><collection>CrossRef</collection><collection>OSTI.GOV - Hybrid</collection><collection>OSTI.GOV</collection><jtitle>Proceedings of the VLDB Endowment</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Akidau, Tyler</au><au>Begoli, Edmon</au><au>Chernyak, Slava</au><au>Hueske, Fabian</au><au>Knight, Kathryn</au><au>Knowles, Kenneth</au><au>Mills, Daniel</au><au>Sotolongo, Dan</au><aucorp>Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</aucorp><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow</atitle><jtitle>Proceedings of the VLDB Endowment</jtitle><date>2021-09-01</date><risdate>2021</risdate><volume>14</volume><issue>12</issue><spage>3135</spage><epage>3147</epage><pages>3135-3147</pages><issn>2150-8097</issn><eissn>2150-8097</eissn><abstract>Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks , a key tool for reasoning about temporal completeness in infinite streams. First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches: • Computing a single correct answer, as in notifications. • Reasoning about a lack of data, as in dip detection. • Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models. • Safely and punctually garbage collecting obsolete inputs and intermediate state. • Surfacing a reliable signal of overall pipeline health . Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.</abstract><cop>United States</cop><doi>10.14778/3476311.3476389</doi><tpages>13</tpages><orcidid>https://orcid.org/0000000221733663</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2150-8097
ispartof Proceedings of the VLDB Endowment, 2021-09, Vol.14 (12), p.3135-3147
issn 2150-8097
2150-8097
language eng
recordid cdi_osti_scitechconnect_1823361
source ACM Digital Library Complete
title Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T01%3A12%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_osti_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Watermarks%20in%20stream%20processing%20systems:%20semantics%20and%20comparative%20analysis%20of%20Apache%20Flink%20and%20Google%20cloud%20dataflow&rft.jtitle=Proceedings%20of%20the%20VLDB%20Endowment&rft.au=Akidau,%20Tyler&rft.aucorp=Oak%20Ridge%20National%20Laboratory%20(ORNL),%20Oak%20Ridge,%20TN%20(United%20States)&rft.date=2021-09-01&rft.volume=14&rft.issue=12&rft.spage=3135&rft.epage=3147&rft.pages=3135-3147&rft.issn=2150-8097&rft.eissn=2150-8097&rft_id=info:doi/10.14778/3476311.3476389&rft_dat=%3Ccrossref_osti_%3E10_14778_3476311_3476389%3C/crossref_osti_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true