Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler

Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Concurrency and computation 2018-06, Vol.30 (12), p.n/a
Hauptverfasser:	Kraemer, Alessandro, Maziero, Carlos, Richard, Olivier, Trystram, Denis
Format:	Artikel
Sprache:	eng
Schlagworte:	Backfilling Cloud computing Computer centers Computer Science Computer simulation Convergence data center Data centers Distributed, Parallel, and Cluster Computing Dumping high performance computing Level (quantity) Microprocessors Migration Partitions Platforms Rescheduling Resource scheduling Response time Scheduling scheduling strategy Servers Simulation Violations Workload Workloads
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	n/a
container_issue	12
container_start_page
container_title	Concurrency and computation
container_volume	30
creator	Kraemer, Alessandro Maziero, Carlos Richard, Olivier Trystram, Denis
description	Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs s
doi_str_mv	10.1002/cpe.4352
format	Article
fullrecord	<record><control><sourceid>proquest_hal_p</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_02066143v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2047376687</sourcerecordid><originalsourceid>FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</originalsourceid><addsrcrecordid>eNp10M1KAzEQB_BFFKxV8BECXvSwNd-7PZZSrVCwiJ5DNjttU7abmuyu9OYj-Iw-iamV3jxlCL_MTP5Jck3wgGBM780WBpwJepL0iGA0xZLx02NN5XlyEcIaY0IwI72kfoGyNbZeomYFqG43BXjkFshD2Lo6AGrsBlAA31kDqIIOKuSKNZjGdoA66yrd2AhRsUMamcq15ffn13Q-RsbVHfgl1PFdMKs4pgJ_mZwtdBXg6u_sJ28Pk9fxNJ09Pz6NR7PUUJHRtDRGmFLkhRlqrrVkxGCJy5wVvOSlwCQXMgNdUpFrNmSLDDjFQyINiT8UfMj6yd2h70pXauvtRvudctqq6Wim9neYYikJZx2J9uZgt969txAatXatr-N6imKesUzKPIvq9qCMdyF4WBzbEqz2yauYvNonH2l6oB-2gt2_To3nk1__A-rjhL4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2047376687</pqid></control><display><type>article</type><title>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</title><source>Wiley Online Library Journals Frontfile Complete</source><creator>Kraemer, Alessandro ; Maziero, Carlos ; Richard, Olivier ; Trystram, Denis</creator><creatorcontrib>Kraemer, Alessandro ; Maziero, Carlos ; Richard, Olivier ; Trystram, Denis</creatorcontrib><description>Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.</description><identifier>ISSN: 1532-0626</identifier><identifier>EISSN: 1532-0634</identifier><identifier>DOI: 10.1002/cpe.4352</identifier><language>eng</language><publisher>Hoboken: Wiley Subscription Services, Inc</publisher><subject>Backfilling ; Cloud computing ; Computer centers ; Computer Science ; Computer simulation ; Convergence ; data center ; Data centers ; Distributed, Parallel, and Cluster Computing ; Dumping ; high performance computing ; Level (quantity) ; Microprocessors ; Migration ; Partitions ; Platforms ; Rescheduling ; Resource scheduling ; Response time ; Scheduling ; scheduling strategy ; Servers ; Simulation ; Violations ; Workload ; Workloads</subject><ispartof>Concurrency and computation, 2018-06, Vol.30 (12), p.n/a</ispartof><rights>Copyright © 2017 John Wiley & Sons, Ltd.</rights><rights>Copyright © 2018 John Wiley & Sons, Ltd.</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</citedby><cites>FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</cites><orcidid>0000-0003-2795-4675 ; 0009-0005-8679-2874 ; 0000-0002-2623-6922</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1002%2Fcpe.4352$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1002%2Fcpe.4352$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>230,314,776,780,881,1411,27903,27904,45553,45554</link.rule.ids><backlink>$$Uhttps://hal.science/hal-02066143$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Kraemer, Alessandro</creatorcontrib><creatorcontrib>Maziero, Carlos</creatorcontrib><creatorcontrib>Richard, Olivier</creatorcontrib><creatorcontrib>Trystram, Denis</creatorcontrib><title>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</title><title>Concurrency and computation</title><description>Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.</description><subject>Backfilling</subject><subject>Cloud computing</subject><subject>Computer centers</subject><subject>Computer Science</subject><subject>Computer simulation</subject><subject>Convergence</subject><subject>data center</subject><subject>Data centers</subject><subject>Distributed, Parallel, and Cluster Computing</subject><subject>Dumping</subject><subject>high performance computing</subject><subject>Level (quantity)</subject><subject>Microprocessors</subject><subject>Migration</subject><subject>Partitions</subject><subject>Platforms</subject><subject>Rescheduling</subject><subject>Resource scheduling</subject><subject>Response time</subject><subject>Scheduling</subject><subject>scheduling strategy</subject><subject>Servers</subject><subject>Simulation</subject><subject>Violations</subject><subject>Workload</subject><subject>Workloads</subject><issn>1532-0626</issn><issn>1532-0634</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNp10M1KAzEQB_BFFKxV8BECXvSwNd-7PZZSrVCwiJ5DNjttU7abmuyu9OYj-Iw-iamV3jxlCL_MTP5Jck3wgGBM780WBpwJepL0iGA0xZLx02NN5XlyEcIaY0IwI72kfoGyNbZeomYFqG43BXjkFshD2Lo6AGrsBlAA31kDqIIOKuSKNZjGdoA66yrd2AhRsUMamcq15ffn13Q-RsbVHfgl1PFdMKs4pgJ_mZwtdBXg6u_sJ28Pk9fxNJ09Pz6NR7PUUJHRtDRGmFLkhRlqrrVkxGCJy5wVvOSlwCQXMgNdUpFrNmSLDDjFQyINiT8UfMj6yd2h70pXauvtRvudctqq6Wim9neYYikJZx2J9uZgt969txAatXatr-N6imKesUzKPIvq9qCMdyF4WBzbEqz2yauYvNonH2l6oB-2gt2_To3nk1__A-rjhL4</recordid><startdate>20180625</startdate><enddate>20180625</enddate><creator>Kraemer, Alessandro</creator><creator>Maziero, Carlos</creator><creator>Richard, Olivier</creator><creator>Trystram, Denis</creator><general>Wiley Subscription Services, Inc</general><general>Wiley</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>1XC</scope><orcidid>https://orcid.org/0000-0003-2795-4675</orcidid><orcidid>https://orcid.org/0009-0005-8679-2874</orcidid><orcidid>https://orcid.org/0000-0002-2623-6922</orcidid></search><sort><creationdate>20180625</creationdate><title>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</title><author>Kraemer, Alessandro ; Maziero, Carlos ; Richard, Olivier ; Trystram, Denis</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Backfilling</topic><topic>Cloud computing</topic><topic>Computer centers</topic><topic>Computer Science</topic><topic>Computer simulation</topic><topic>Convergence</topic><topic>data center</topic><topic>Data centers</topic><topic>Distributed, Parallel, and Cluster Computing</topic><topic>Dumping</topic><topic>high performance computing</topic><topic>Level (quantity)</topic><topic>Microprocessors</topic><topic>Migration</topic><topic>Partitions</topic><topic>Platforms</topic><topic>Rescheduling</topic><topic>Resource scheduling</topic><topic>Response time</topic><topic>Scheduling</topic><topic>scheduling strategy</topic><topic>Servers</topic><topic>Simulation</topic><topic>Violations</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kraemer, Alessandro</creatorcontrib><creatorcontrib>Maziero, Carlos</creatorcontrib><creatorcontrib>Richard, Olivier</creatorcontrib><creatorcontrib>Trystram, Denis</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Hyper Article en Ligne (HAL)</collection><jtitle>Concurrency and computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kraemer, Alessandro</au><au>Maziero, Carlos</au><au>Richard, Olivier</au><au>Trystram, Denis</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</atitle><jtitle>Concurrency and computation</jtitle><date>2018-06-25</date><risdate>2018</risdate><volume>30</volume><issue>12</issue><epage>n/a</epage><issn>1532-0626</issn><eissn>1532-0634</eissn><abstract>Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.</abstract><cop>Hoboken</cop><pub>Wiley Subscription Services, Inc</pub><doi>10.1002/cpe.4352</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0003-2795-4675</orcidid><orcidid>https://orcid.org/0009-0005-8679-2874</orcidid><orcidid>https://orcid.org/0000-0002-2623-6922</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1532-0626
ispartof	Concurrency and computation, 2018-06, Vol.30 (12), p.n/a
issn	1532-0626 1532-0634
language	eng
recordid	cdi_hal_primary_oai_HAL_hal_02066143v1
source	Wiley Online Library Journals Frontfile Complete
subjects	Backfilling Cloud computing Computer centers Computer Science Computer simulation Convergence data center Data centers Distributed, Parallel, and Cluster Computing Dumping high performance computing Level (quantity) Microprocessors Migration Partitions Platforms Rescheduling Resource scheduling Response time Scheduling scheduling strategy Servers Simulation Violations Workload Workloads
title	Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T02%3A14%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_hal_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reducing%20the%20number%20of%20response%20time%20service%20level%20objective%20violations%20by%20a%20cloud%E2%80%90HPC%20convergence%20scheduler&rft.jtitle=Concurrency%20and%20computation&rft.au=Kraemer,%20Alessandro&rft.date=2018-06-25&rft.volume=30&rft.issue=12&rft.epage=n/a&rft.issn=1532-0626&rft.eissn=1532-0634&rft_id=info:doi/10.1002/cpe.4352&rft_dat=%3Cproquest_hal_p%3E2047376687%3C/proquest_hal_p%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2047376687&rft_id=info:pmid/&rfr_iscdi=true