Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing

Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on computer systems 2024-05, Vol.42 (1-2), p.1-37, Article 2
Hauptverfasser: Zhao, Laiping, Cui, Yushuai, Yang, Yanan, Zhou, Xiaobo, Qiu, Tie, Li, Keqiu, Bao, Yungang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 37
container_issue 1-2
container_start_page 1
container_title ACM transactions on computer systems
container_volume 42
creator Zhao, Laiping
Cui, Yushuai
Yang, Yanan
Zhou, Xiaobo
Qiu, Tie
Li, Keqiu
Bao, Yungang
description Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.
doi_str_mv 10.1145/3630006
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3058280272</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3058280272</sourcerecordid><originalsourceid>FETCH-LOGICAL-a267t-42836cda6f1a2b2b72586d34594c0aee09cda7d67eeffffe5344108a607a7af73</originalsourceid><addsrcrecordid>eNo9kM1LxDAQxYMouK7i3VPBg6dovtM9SlFXWBBEb0KZTdNtl7apSXrwvzdLV-cyw7wfb4aH0DUl95QK-cAVJ4SoE7SgUmqsOeenaEE0F5gRTc_RRQj7RKQ9W6CvwvWjG-wQcdWG2A67qQ0NbDubFQ53zkBs3ZDBUGXvNrjJG5sG00E_C7Xz2brdNTg23k27ZpxidrCcDlaX6KyGLtirY1-iz-enj2KNN28vr8XjBgNTOmLBcq5MBaqmwLZsq5nMVcWFXAlDwFqySqKulLa2TmUlF4KSHBTRoKHWfIluZ9_Ru-_Jhlju06dDOllyInOWE6ZZou5myngXgrd1Ofq2B_9TUlIeoiuP0SXyZibB9P_Qn_gLGcFplA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3058280272</pqid></control><display><type>article</type><title>Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing</title><source>ACM Digital Library</source><source>Business Source Complete</source><creator>Zhao, Laiping ; Cui, Yushuai ; Yang, Yanan ; Zhou, Xiaobo ; Qiu, Tie ; Li, Keqiu ; Bao, Yungang</creator><creatorcontrib>Zhao, Laiping ; Cui, Yushuai ; Yang, Yanan ; Zhou, Xiaobo ; Qiu, Tie ; Li, Keqiu ; Bao, Yungang</creatorcontrib><description>Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.</description><identifier>ISSN: 0734-2071</identifier><identifier>EISSN: 1557-7333</identifier><identifier>DOI: 10.1145/3630006</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Cloud computing ; Computer systems organization ; Interference ; Reclamation ; Resource utilization ; Rhythm ; Workload ; Workloads</subject><ispartof>ACM transactions on computer systems, 2024-05, Vol.42 (1-2), p.1-37, Article 2</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><rights>Copyright Association for Computing Machinery May 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a267t-42836cda6f1a2b2b72586d34594c0aee09cda7d67eeffffe5344108a607a7af73</cites><orcidid>0000-0003-1967-2192 ; 0000-0002-7772-458X ; 0000-0003-2324-2523 ; 0009-0009-9413-5187 ; 0000-0002-2222-393X ; 0000-0003-1758-3030 ; 0000-0001-6565-5276</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3630006$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,780,784,2282,27924,27925,40196,76228</link.rule.ids></links><search><creatorcontrib>Zhao, Laiping</creatorcontrib><creatorcontrib>Cui, Yushuai</creatorcontrib><creatorcontrib>Yang, Yanan</creatorcontrib><creatorcontrib>Zhou, Xiaobo</creatorcontrib><creatorcontrib>Qiu, Tie</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Bao, Yungang</creatorcontrib><title>Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing</title><title>ACM transactions on computer systems</title><addtitle>ACM TOCS</addtitle><description>Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.</description><subject>Cloud computing</subject><subject>Computer systems organization</subject><subject>Interference</subject><subject>Reclamation</subject><subject>Resource utilization</subject><subject>Rhythm</subject><subject>Workload</subject><subject>Workloads</subject><issn>0734-2071</issn><issn>1557-7333</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kM1LxDAQxYMouK7i3VPBg6dovtM9SlFXWBBEb0KZTdNtl7apSXrwvzdLV-cyw7wfb4aH0DUl95QK-cAVJ4SoE7SgUmqsOeenaEE0F5gRTc_RRQj7RKQ9W6CvwvWjG-wQcdWG2A67qQ0NbDubFQ53zkBs3ZDBUGXvNrjJG5sG00E_C7Xz2brdNTg23k27ZpxidrCcDlaX6KyGLtirY1-iz-enj2KNN28vr8XjBgNTOmLBcq5MBaqmwLZsq5nMVcWFXAlDwFqySqKulLa2TmUlF4KSHBTRoKHWfIluZ9_Ru-_Jhlju06dDOllyInOWE6ZZou5myngXgrd1Ofq2B_9TUlIeoiuP0SXyZibB9P_Qn_gLGcFplA</recordid><startdate>20240501</startdate><enddate>20240501</enddate><creator>Zhao, Laiping</creator><creator>Cui, Yushuai</creator><creator>Yang, Yanan</creator><creator>Zhou, Xiaobo</creator><creator>Qiu, Tie</creator><creator>Li, Keqiu</creator><creator>Bao, Yungang</creator><general>ACM</general><general>Association for Computing Machinery</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1967-2192</orcidid><orcidid>https://orcid.org/0000-0002-7772-458X</orcidid><orcidid>https://orcid.org/0000-0003-2324-2523</orcidid><orcidid>https://orcid.org/0009-0009-9413-5187</orcidid><orcidid>https://orcid.org/0000-0002-2222-393X</orcidid><orcidid>https://orcid.org/0000-0003-1758-3030</orcidid><orcidid>https://orcid.org/0000-0001-6565-5276</orcidid></search><sort><creationdate>20240501</creationdate><title>Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing</title><author>Zhao, Laiping ; Cui, Yushuai ; Yang, Yanan ; Zhou, Xiaobo ; Qiu, Tie ; Li, Keqiu ; Bao, Yungang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a267t-42836cda6f1a2b2b72586d34594c0aee09cda7d67eeffffe5344108a607a7af73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cloud computing</topic><topic>Computer systems organization</topic><topic>Interference</topic><topic>Reclamation</topic><topic>Resource utilization</topic><topic>Rhythm</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Laiping</creatorcontrib><creatorcontrib>Cui, Yushuai</creatorcontrib><creatorcontrib>Yang, Yanan</creatorcontrib><creatorcontrib>Zhou, Xiaobo</creatorcontrib><creatorcontrib>Qiu, Tie</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Bao, Yungang</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>ACM transactions on computer systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhao, Laiping</au><au>Cui, Yushuai</au><au>Yang, Yanan</au><au>Zhou, Xiaobo</au><au>Qiu, Tie</au><au>Li, Keqiu</au><au>Bao, Yungang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing</atitle><jtitle>ACM transactions on computer systems</jtitle><stitle>ACM TOCS</stitle><date>2024-05-01</date><risdate>2024</risdate><volume>42</volume><issue>1-2</issue><spage>1</spage><epage>37</epage><pages>1-37</pages><artnum>2</artnum><issn>0734-2071</issn><eissn>1557-7333</eissn><abstract>Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3630006</doi><tpages>37</tpages><orcidid>https://orcid.org/0000-0003-1967-2192</orcidid><orcidid>https://orcid.org/0000-0002-7772-458X</orcidid><orcidid>https://orcid.org/0000-0003-2324-2523</orcidid><orcidid>https://orcid.org/0009-0009-9413-5187</orcidid><orcidid>https://orcid.org/0000-0002-2222-393X</orcidid><orcidid>https://orcid.org/0000-0003-1758-3030</orcidid><orcidid>https://orcid.org/0000-0001-6565-5276</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0734-2071
ispartof ACM transactions on computer systems, 2024-05, Vol.42 (1-2), p.1-37, Article 2
issn 0734-2071
1557-7333
language eng
recordid cdi_proquest_journals_3058280272
source ACM Digital Library; Business Source Complete
subjects Cloud computing
Computer systems organization
Interference
Reclamation
Resource utilization
Rhythm
Workload
Workloads
title Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T06%3A54%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Component-distinguishable%20Co-location%20and%20Resource%20Reclamation%20for%20High-throughput%20Computing&rft.jtitle=ACM%20transactions%20on%20computer%20systems&rft.au=Zhao,%20Laiping&rft.date=2024-05-01&rft.volume=42&rft.issue=1-2&rft.spage=1&rft.epage=37&rft.pages=1-37&rft.artnum=2&rft.issn=0734-2071&rft.eissn=1557-7333&rft_id=info:doi/10.1145/3630006&rft_dat=%3Cproquest_cross%3E3058280272%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3058280272&rft_id=info:pmid/&rfr_iscdi=true