Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler

Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Concurrency and computation 2018-06, Vol.30 (12), p.n/a
Hauptverfasser: Kraemer, Alessandro, Maziero, Carlos, Richard, Olivier, Trystram, Denis
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page n/a
container_issue 12
container_start_page
container_title Concurrency and computation
container_volume 30
creator Kraemer, Alessandro
Maziero, Carlos
Richard, Olivier
Trystram, Denis
description Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs s
doi_str_mv 10.1002/cpe.4352
format Article
fullrecord <record><control><sourceid>proquest_hal_p</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_02066143v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2047376687</sourcerecordid><originalsourceid>FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</originalsourceid><addsrcrecordid>eNp10M1KAzEQB_BFFKxV8BECXvSwNd-7PZZSrVCwiJ5DNjttU7abmuyu9OYj-Iw-iamV3jxlCL_MTP5Jck3wgGBM780WBpwJepL0iGA0xZLx02NN5XlyEcIaY0IwI72kfoGyNbZeomYFqG43BXjkFshD2Lo6AGrsBlAA31kDqIIOKuSKNZjGdoA66yrd2AhRsUMamcq15ffn13Q-RsbVHfgl1PFdMKs4pgJ_mZwtdBXg6u_sJ28Pk9fxNJ09Pz6NR7PUUJHRtDRGmFLkhRlqrrVkxGCJy5wVvOSlwCQXMgNdUpFrNmSLDDjFQyINiT8UfMj6yd2h70pXauvtRvudctqq6Wim9neYYikJZx2J9uZgt969txAatXatr-N6imKesUzKPIvq9qCMdyF4WBzbEqz2yauYvNonH2l6oB-2gt2_To3nk1__A-rjhL4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2047376687</pqid></control><display><type>article</type><title>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</title><source>Wiley Online Library Journals Frontfile Complete</source><creator>Kraemer, Alessandro ; Maziero, Carlos ; Richard, Olivier ; Trystram, Denis</creator><creatorcontrib>Kraemer, Alessandro ; Maziero, Carlos ; Richard, Olivier ; Trystram, Denis</creatorcontrib><description>Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.</description><identifier>ISSN: 1532-0626</identifier><identifier>EISSN: 1532-0634</identifier><identifier>DOI: 10.1002/cpe.4352</identifier><language>eng</language><publisher>Hoboken: Wiley Subscription Services, Inc</publisher><subject>Backfilling ; Cloud computing ; Computer centers ; Computer Science ; Computer simulation ; Convergence ; data center ; Data centers ; Distributed, Parallel, and Cluster Computing ; Dumping ; high performance computing ; Level (quantity) ; Microprocessors ; Migration ; Partitions ; Platforms ; Rescheduling ; Resource scheduling ; Response time ; Scheduling ; scheduling strategy ; Servers ; Simulation ; Violations ; Workload ; Workloads</subject><ispartof>Concurrency and computation, 2018-06, Vol.30 (12), p.n/a</ispartof><rights>Copyright © 2017 John Wiley &amp; Sons, Ltd.</rights><rights>Copyright © 2018 John Wiley &amp; Sons, Ltd.</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</citedby><cites>FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</cites><orcidid>0000-0003-2795-4675 ; 0009-0005-8679-2874 ; 0000-0002-2623-6922</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1002%2Fcpe.4352$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1002%2Fcpe.4352$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>230,314,776,780,881,1411,27903,27904,45553,45554</link.rule.ids><backlink>$$Uhttps://hal.science/hal-02066143$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Kraemer, Alessandro</creatorcontrib><creatorcontrib>Maziero, Carlos</creatorcontrib><creatorcontrib>Richard, Olivier</creatorcontrib><creatorcontrib>Trystram, Denis</creatorcontrib><title>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</title><title>Concurrency and computation</title><description>Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.</description><subject>Backfilling</subject><subject>Cloud computing</subject><subject>Computer centers</subject><subject>Computer Science</subject><subject>Computer simulation</subject><subject>Convergence</subject><subject>data center</subject><subject>Data centers</subject><subject>Distributed, Parallel, and Cluster Computing</subject><subject>Dumping</subject><subject>high performance computing</subject><subject>Level (quantity)</subject><subject>Microprocessors</subject><subject>Migration</subject><subject>Partitions</subject><subject>Platforms</subject><subject>Rescheduling</subject><subject>Resource scheduling</subject><subject>Response time</subject><subject>Scheduling</subject><subject>scheduling strategy</subject><subject>Servers</subject><subject>Simulation</subject><subject>Violations</subject><subject>Workload</subject><subject>Workloads</subject><issn>1532-0626</issn><issn>1532-0634</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNp10M1KAzEQB_BFFKxV8BECXvSwNd-7PZZSrVCwiJ5DNjttU7abmuyu9OYj-Iw-iamV3jxlCL_MTP5Jck3wgGBM780WBpwJepL0iGA0xZLx02NN5XlyEcIaY0IwI72kfoGyNbZeomYFqG43BXjkFshD2Lo6AGrsBlAA31kDqIIOKuSKNZjGdoA66yrd2AhRsUMamcq15ffn13Q-RsbVHfgl1PFdMKs4pgJ_mZwtdBXg6u_sJ28Pk9fxNJ09Pz6NR7PUUJHRtDRGmFLkhRlqrrVkxGCJy5wVvOSlwCQXMgNdUpFrNmSLDDjFQyINiT8UfMj6yd2h70pXauvtRvudctqq6Wim9neYYikJZx2J9uZgt969txAatXatr-N6imKesUzKPIvq9qCMdyF4WBzbEqz2yauYvNonH2l6oB-2gt2_To3nk1__A-rjhL4</recordid><startdate>20180625</startdate><enddate>20180625</enddate><creator>Kraemer, Alessandro</creator><creator>Maziero, Carlos</creator><creator>Richard, Olivier</creator><creator>Trystram, Denis</creator><general>Wiley Subscription Services, Inc</general><general>Wiley</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>1XC</scope><orcidid>https://orcid.org/0000-0003-2795-4675</orcidid><orcidid>https://orcid.org/0009-0005-8679-2874</orcidid><orcidid>https://orcid.org/0000-0002-2623-6922</orcidid></search><sort><creationdate>20180625</creationdate><title>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</title><author>Kraemer, Alessandro ; Maziero, Carlos ; Richard, Olivier ; Trystram, Denis</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Backfilling</topic><topic>Cloud computing</topic><topic>Computer centers</topic><topic>Computer Science</topic><topic>Computer simulation</topic><topic>Convergence</topic><topic>data center</topic><topic>Data centers</topic><topic>Distributed, Parallel, and Cluster Computing</topic><topic>Dumping</topic><topic>high performance computing</topic><topic>Level (quantity)</topic><topic>Microprocessors</topic><topic>Migration</topic><topic>Partitions</topic><topic>Platforms</topic><topic>Rescheduling</topic><topic>Resource scheduling</topic><topic>Response time</topic><topic>Scheduling</topic><topic>scheduling strategy</topic><topic>Servers</topic><topic>Simulation</topic><topic>Violations</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kraemer, Alessandro</creatorcontrib><creatorcontrib>Maziero, Carlos</creatorcontrib><creatorcontrib>Richard, Olivier</creatorcontrib><creatorcontrib>Trystram, Denis</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Hyper Article en Ligne (HAL)</collection><jtitle>Concurrency and computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kraemer, Alessandro</au><au>Maziero, Carlos</au><au>Richard, Olivier</au><au>Trystram, Denis</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</atitle><jtitle>Concurrency and computation</jtitle><date>2018-06-25</date><risdate>2018</risdate><volume>30</volume><issue>12</issue><epage>n/a</epage><issn>1532-0626</issn><eissn>1532-0634</eissn><abstract>Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.</abstract><cop>Hoboken</cop><pub>Wiley Subscription Services, Inc</pub><doi>10.1002/cpe.4352</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0003-2795-4675</orcidid><orcidid>https://orcid.org/0009-0005-8679-2874</orcidid><orcidid>https://orcid.org/0000-0002-2623-6922</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1532-0626
ispartof Concurrency and computation, 2018-06, Vol.30 (12), p.n/a
issn 1532-0626
1532-0634
language eng
recordid cdi_hal_primary_oai_HAL_hal_02066143v1
source Wiley Online Library Journals Frontfile Complete
subjects Backfilling
Cloud computing
Computer centers
Computer Science
Computer simulation
Convergence
data center
Data centers
Distributed, Parallel, and Cluster Computing
Dumping
high performance computing
Level (quantity)
Microprocessors
Migration
Partitions
Platforms
Rescheduling
Resource scheduling
Response time
Scheduling
scheduling strategy
Servers
Simulation
Violations
Workload
Workloads
title Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T02%3A14%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_hal_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reducing%20the%20number%20of%20response%20time%20service%20level%20objective%20violations%20by%20a%20cloud%E2%80%90HPC%20convergence%20scheduler&rft.jtitle=Concurrency%20and%20computation&rft.au=Kraemer,%20Alessandro&rft.date=2018-06-25&rft.volume=30&rft.issue=12&rft.epage=n/a&rft.issn=1532-0626&rft.eissn=1532-0634&rft_id=info:doi/10.1002/cpe.4352&rft_dat=%3Cproquest_hal_p%3E2047376687%3C/proquest_hal_p%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2047376687&rft_id=info:pmid/&rfr_iscdi=true