StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance

Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the VLDB Endowment 2023-08, Vol.16 (12), p.3501-3514
Hauptverfasser: Mao, Yancan, Chen, Zhanghao, Zhang, Yifan, Wang, Meng, Fang, Yong, Zhang, Guanghui, Shi, Rui, Ma, Richard T. B.
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3514
container_issue 12
container_start_page 3501
container_title Proceedings of the VLDB Endowment
container_volume 16
creator Mao, Yancan
Chen, Zhanghao
Zhang, Yifan
Wang, Meng
Fang, Yong
Zhang, Guanghui
Shi, Rui
Ma, Richard T. B.
description Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.
doi_str_mv 10.14778/3611540.3611543
format Article
fullrecord <record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_14778_3611540_3611543</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_14778_3611540_3611543</sourcerecordid><originalsourceid>FETCH-LOGICAL-c243t-156329efbde0074b3099e56882631d1cfb4761e360fbbb6b9c47c722f301ae73</originalsourceid><addsrcrecordid>eNpNkL1OwzAURi0EEqWwM_oFUq5jx3bYIPxKpZVo98h2rqugJqlst1LfHolkYDrf8OkMh5B7BgsmlNIPXDJWCFiM5BdklrMCMg2luvy3r8lNjD8AUkumZ2S1SQFNtz7ER1rth2OTrUxqT0i_j31qO6Rfpjc77LBP1A-Bjve239ENhlPrMNK2p8_nhC-md3hLrrzZR7ybOCfbt9dt9ZEt1--f1dMyc7ngKWOF5HmJ3jYIoITlUJZYSK1zyVnDnLdCSYZcgrfWSls6oZzKc8-BGVR8TmDUujDEGNDXh9B2JpxrBvVfjnrKMZHzXy5BUgk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance</title><source>Access via ACM Digital Library</source><creator>Mao, Yancan ; Chen, Zhanghao ; Zhang, Yifan ; Wang, Meng ; Fang, Yong ; Zhang, Guanghui ; Shi, Rui ; Ma, Richard T. B.</creator><creatorcontrib>Mao, Yancan ; Chen, Zhanghao ; Zhang, Yifan ; Wang, Meng ; Fang, Yong ; Zhang, Guanghui ; Shi, Rui ; Ma, Richard T. B.</creatorcontrib><description>Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.</description><identifier>ISSN: 2150-8097</identifier><identifier>EISSN: 2150-8097</identifier><identifier>DOI: 10.14778/3611540.3611543</identifier><language>eng</language><ispartof>Proceedings of the VLDB Endowment, 2023-08, Vol.16 (12), p.3501-3514</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c243t-156329efbde0074b3099e56882631d1cfb4761e360fbbb6b9c47c722f301ae73</citedby><cites>FETCH-LOGICAL-c243t-156329efbde0074b3099e56882631d1cfb4761e360fbbb6b9c47c722f301ae73</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Mao, Yancan</creatorcontrib><creatorcontrib>Chen, Zhanghao</creatorcontrib><creatorcontrib>Zhang, Yifan</creatorcontrib><creatorcontrib>Wang, Meng</creatorcontrib><creatorcontrib>Fang, Yong</creatorcontrib><creatorcontrib>Zhang, Guanghui</creatorcontrib><creatorcontrib>Shi, Rui</creatorcontrib><creatorcontrib>Ma, Richard T. B.</creatorcontrib><title>StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance</title><title>Proceedings of the VLDB Endowment</title><description>Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.</description><issn>2150-8097</issn><issn>2150-8097</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNpNkL1OwzAURi0EEqWwM_oFUq5jx3bYIPxKpZVo98h2rqugJqlst1LfHolkYDrf8OkMh5B7BgsmlNIPXDJWCFiM5BdklrMCMg2luvy3r8lNjD8AUkumZ2S1SQFNtz7ER1rth2OTrUxqT0i_j31qO6Rfpjc77LBP1A-Bjve239ENhlPrMNK2p8_nhC-md3hLrrzZR7ybOCfbt9dt9ZEt1--f1dMyc7ngKWOF5HmJ3jYIoITlUJZYSK1zyVnDnLdCSYZcgrfWSls6oZzKc8-BGVR8TmDUujDEGNDXh9B2JpxrBvVfjnrKMZHzXy5BUgk</recordid><startdate>20230801</startdate><enddate>20230801</enddate><creator>Mao, Yancan</creator><creator>Chen, Zhanghao</creator><creator>Zhang, Yifan</creator><creator>Wang, Meng</creator><creator>Fang, Yong</creator><creator>Zhang, Guanghui</creator><creator>Shi, Rui</creator><creator>Ma, Richard T. B.</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20230801</creationdate><title>StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance</title><author>Mao, Yancan ; Chen, Zhanghao ; Zhang, Yifan ; Wang, Meng ; Fang, Yong ; Zhang, Guanghui ; Shi, Rui ; Ma, Richard T. B.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c243t-156329efbde0074b3099e56882631d1cfb4761e360fbbb6b9c47c722f301ae73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mao, Yancan</creatorcontrib><creatorcontrib>Chen, Zhanghao</creatorcontrib><creatorcontrib>Zhang, Yifan</creatorcontrib><creatorcontrib>Wang, Meng</creatorcontrib><creatorcontrib>Fang, Yong</creatorcontrib><creatorcontrib>Zhang, Guanghui</creatorcontrib><creatorcontrib>Shi, Rui</creatorcontrib><creatorcontrib>Ma, Richard T. B.</creatorcontrib><collection>CrossRef</collection><jtitle>Proceedings of the VLDB Endowment</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mao, Yancan</au><au>Chen, Zhanghao</au><au>Zhang, Yifan</au><au>Wang, Meng</au><au>Fang, Yong</au><au>Zhang, Guanghui</au><au>Shi, Rui</au><au>Ma, Richard T. B.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance</atitle><jtitle>Proceedings of the VLDB Endowment</jtitle><date>2023-08-01</date><risdate>2023</risdate><volume>16</volume><issue>12</issue><spage>3501</spage><epage>3514</epage><pages>3501-3514</pages><issn>2150-8097</issn><eissn>2150-8097</eissn><abstract>Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.</abstract><doi>10.14778/3611540.3611543</doi><tpages>14</tpages></addata></record>
fulltext fulltext
identifier ISSN: 2150-8097
ispartof Proceedings of the VLDB Endowment, 2023-08, Vol.16 (12), p.3501-3514
issn 2150-8097
2150-8097
language eng
recordid cdi_crossref_primary_10_14778_3611540_3611543
source Access via ACM Digital Library
title StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T15%3A46%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=StreamOps:%20Cloud-Native%20Runtime%20Management%20for%20Streaming%20Services%20in%20ByteDance&rft.jtitle=Proceedings%20of%20the%20VLDB%20Endowment&rft.au=Mao,%20Yancan&rft.date=2023-08-01&rft.volume=16&rft.issue=12&rft.spage=3501&rft.epage=3514&rft.pages=3501-3514&rft.issn=2150-8097&rft.eissn=2150-8097&rft_id=info:doi/10.14778/3611540.3611543&rft_dat=%3Ccrossref%3E10_14778_3611540_3611543%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true