Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing

Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on computer systems 2024-05, Vol.42 (1-2), p.1-37, Article 2
Hauptverfasser:	Zhao, Laiping, Cui, Yushuai, Yang, Yanan, Zhou, Xiaobo, Qiu, Tie, Li, Keqiu, Bao, Yungang
Format:	Artikel
Sprache:	eng
Schlagworte:	Cloud computing Computer systems organization Interference Reclamation Resource utilization Rhythm Workload Workloads
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	37
container_issue	1-2
container_start_page	1
container_title	ACM transactions on computer systems
container_volume	42
creator	Zhao, Laiping Cui, Yushuai Yang, Yanan Zhou, Xiaobo Qiu, Tie Li, Keqiu Bao, Yungang
description	Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.
doi_str_mv	10.1145/3630006
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3058280272</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3058280272</sourcerecordid><originalsourceid>FETCH-LOGICAL-a267t-42836cda6f1a2b2b72586d34594c0aee09cda7d67eeffffe5344108a607a7af73</originalsourceid><addsrcrecordid>eNo9kM1LxDAQxYMouK7i3VPBg6dovtM9SlFXWBBEb0KZTdNtl7apSXrwvzdLV-cyw7wfb4aH0DUl95QK-cAVJ4SoE7SgUmqsOeenaEE0F5gRTc_RRQj7RKQ9W6CvwvWjG-wQcdWG2A67qQ0NbDubFQ53zkBs3ZDBUGXvNrjJG5sG00E_C7Xz2brdNTg23k27ZpxidrCcDlaX6KyGLtirY1-iz-enj2KNN28vr8XjBgNTOmLBcq5MBaqmwLZsq5nMVcWFXAlDwFqySqKulLa2TmUlF4KSHBTRoKHWfIluZ9_Ru-_Jhlju06dDOllyInOWE6ZZou5myngXgrd1Ofq2B_9TUlIeoiuP0SXyZibB9P_Qn_gLGcFplA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3058280272</pqid></control><display><type>article</type><title>Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing</title><source>ACM Digital Library</source><source>Business Source Complete</source><creator>Zhao, Laiping ; Cui, Yushuai ; Yang, Yanan ; Zhou, Xiaobo ; Qiu, Tie ; Li, Keqiu ; Bao, Yungang</creator><creatorcontrib>Zhao, Laiping ; Cui, Yushuai ; Yang, Yanan ; Zhou, Xiaobo ; Qiu, Tie ; Li, Keqiu ; Bao, Yungang</creatorcontrib><description>Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.</description><identifier>ISSN: 0734-2071</identifier><identifier>EISSN: 1557-7333</identifier><identifier>DOI: 10.1145/3630006</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Cloud computing ; Computer systems organization ; Interference ; Reclamation ; Resource utilization ; Rhythm ; Workload ; Workloads</subject><ispartof>ACM transactions on computer systems, 2024-05, Vol.42 (1-2), p.1-37, Article 2</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><rights>Copyright Association for Computing Machinery May 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a267t-42836cda6f1a2b2b72586d34594c0aee09cda7d67eeffffe5344108a607a7af73</cites><orcidid>0000-0003-1967-2192 ; 0000-0002-7772-458X ; 0000-0003-2324-2523 ; 0009-0009-9413-5187 ; 0000-0002-2222-393X ; 0000-0003-1758-3030 ; 0000-0001-6565-5276</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3630006$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,780,784,2282,27924,27925,40196,76228</link.rule.ids></links><search><creatorcontrib>Zhao, Laiping</creatorcontrib><creatorcontrib>Cui, Yushuai</creatorcontrib><creatorcontrib>Yang, Yanan</creatorcontrib><creatorcontrib>Zhou, Xiaobo</creatorcontrib><creatorcontrib>Qiu, Tie</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Bao, Yungang</creatorcontrib><title>Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing</title><title>ACM transactions on computer systems</title><addtitle>ACM TOCS</addtitle><description>Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.</description><subject>Cloud computing</subject><subject>Computer systems organization</subject><subject>Interference</subject><subject>Reclamation</subject><subject>Resource utilization</subject><subject>Rhythm</subject><subject>Workload</subject><subject>Workloads</subject><issn>0734-2071</issn><issn>1557-7333</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kM1LxDAQxYMouK7i3VPBg6dovtM9SlFXWBBEb0KZTdNtl7apSXrwvzdLV-cyw7wfb4aH0DUl95QK-cAVJ4SoE7SgUmqsOeenaEE0F5gRTc_RRQj7RKQ9W6CvwvWjG-wQcdWG2A67qQ0NbDubFQ53zkBs3ZDBUGXvNrjJG5sG00E_C7Xz2brdNTg23k27ZpxidrCcDlaX6KyGLtirY1-iz-enj2KNN28vr8XjBgNTOmLBcq5MBaqmwLZsq5nMVcWFXAlDwFqySqKulLa2TmUlF4KSHBTRoKHWfIluZ9_Ru-_Jhlju06dDOllyInOWE6ZZou5myngXgrd1Ofq2B_9TUlIeoiuP0SXyZibB9P_Qn_gLGcFplA</recordid><startdate>20240501</startdate><enddate>20240501</enddate><creator>Zhao, Laiping</creator><creator>Cui, Yushuai</creator><creator>Yang, Yanan</creator><creator>Zhou, Xiaobo</creator><creator>Qiu, Tie</creator><creator>Li, Keqiu</creator><creator>Bao, Yungang</creator><general>ACM</general><general>Association for Computing Machinery</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1967-2192</orcidid><orcidid>https://orcid.org/0000-0002-7772-458X</orcidid><orcidid>https://orcid.org/0000-0003-2324-2523</orcidid><orcidid>https://orcid.org/0009-0009-9413-5187</orcidid><orcidid>https://orcid.org/0000-0002-2222-393X</orcidid><orcidid>https://orcid.org/0000-0003-1758-3030</orcidid><orcidid>https://orcid.org/0000-0001-6565-5276</orcidid></search><sort><creationdate>20240501</creationdate><title>Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing</title><author>Zhao, Laiping ; Cui, Yushuai ; Yang, Yanan ; Zhou, Xiaobo ; Qiu, Tie ; Li, Keqiu ; Bao, Yungang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a267t-42836cda6f1a2b2b72586d34594c0aee09cda7d67eeffffe5344108a607a7af73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cloud computing</topic><topic>Computer systems organization</topic><topic>Interference</topic><topic>Reclamation</topic><topic>Resource utilization</topic><topic>Rhythm</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Laiping</creatorcontrib><creatorcontrib>Cui, Yushuai</creatorcontrib><creatorcontrib>Yang, Yanan</creatorcontrib><creatorcontrib>Zhou, Xiaobo</creatorcontrib><creatorcontrib>Qiu, Tie</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Bao, Yungang</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>ACM transactions on computer systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhao, Laiping</au><au>Cui, Yushuai</au><au>Yang, Yanan</au><au>Zhou, Xiaobo</au><au>Qiu, Tie</au><au>Li, Keqiu</au><au>Bao, Yungang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing</atitle><jtitle>ACM transactions on computer systems</jtitle><stitle>ACM TOCS</stitle><date>2024-05-01</date><risdate>2024</risdate><volume>42</volume><issue>1-2</issue><spage>1</spage><epage>37</epage><pages>1-37</pages><artnum>2</artnum><issn>0734-2071</issn><eissn>1557-7333</eissn><abstract>Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3630006</doi><tpages>37</tpages><orcidid>https://orcid.org/0000-0003-1967-2192</orcidid><orcidid>https://orcid.org/0000-0002-7772-458X</orcidid><orcidid>https://orcid.org/0000-0003-2324-2523</orcidid><orcidid>https://orcid.org/0009-0009-9413-5187</orcidid><orcidid>https://orcid.org/0000-0002-2222-393X</orcidid><orcidid>https://orcid.org/0000-0003-1758-3030</orcidid><orcidid>https://orcid.org/0000-0001-6565-5276</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0734-2071
ispartof	ACM transactions on computer systems, 2024-05, Vol.42 (1-2), p.1-37, Article 2
issn	0734-2071 1557-7333
language	eng
recordid	cdi_proquest_journals_3058280272
source	ACM Digital Library; Business Source Complete
subjects	Cloud computing Computer systems organization Interference Reclamation Resource utilization Rhythm Workload Workloads
title	Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T06%3A54%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Component-distinguishable%20Co-location%20and%20Resource%20Reclamation%20for%20High-throughput%20Computing&rft.jtitle=ACM%20transactions%20on%20computer%20systems&rft.au=Zhao,%20Laiping&rft.date=2024-05-01&rft.volume=42&rft.issue=1-2&rft.spage=1&rft.epage=37&rft.pages=1-37&rft.artnum=2&rft.issn=0734-2071&rft.eissn=1557-7333&rft_id=info:doi/10.1145/3630006&rft_dat=%3Cproquest_cross%3E3058280272%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3058280272&rft_id=info:pmid/&rfr_iscdi=true