Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler
Summary Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs...
Gespeichert in:
Veröffentlicht in: | Concurrency and computation 2018-06, Vol.30 (12), p.n/a |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | n/a |
---|---|
container_issue | 12 |
container_start_page | |
container_title | Concurrency and computation |
container_volume | 30 |
creator | Kraemer, Alessandro Maziero, Carlos Richard, Olivier Trystram, Denis |
description | Summary
Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs s |
doi_str_mv | 10.1002/cpe.4352 |
format | Article |
fullrecord | <record><control><sourceid>proquest_hal_p</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_02066143v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2047376687</sourcerecordid><originalsourceid>FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</originalsourceid><addsrcrecordid>eNp10M1KAzEQB_BFFKxV8BECXvSwNd-7PZZSrVCwiJ5DNjttU7abmuyu9OYj-Iw-iamV3jxlCL_MTP5Jck3wgGBM780WBpwJepL0iGA0xZLx02NN5XlyEcIaY0IwI72kfoGyNbZeomYFqG43BXjkFshD2Lo6AGrsBlAA31kDqIIOKuSKNZjGdoA66yrd2AhRsUMamcq15ffn13Q-RsbVHfgl1PFdMKs4pgJ_mZwtdBXg6u_sJ28Pk9fxNJ09Pz6NR7PUUJHRtDRGmFLkhRlqrrVkxGCJy5wVvOSlwCQXMgNdUpFrNmSLDDjFQyINiT8UfMj6yd2h70pXauvtRvudctqq6Wim9neYYikJZx2J9uZgt969txAatXatr-N6imKesUzKPIvq9qCMdyF4WBzbEqz2yauYvNonH2l6oB-2gt2_To3nk1__A-rjhL4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2047376687</pqid></control><display><type>article</type><title>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</title><source>Wiley Online Library Journals Frontfile Complete</source><creator>Kraemer, Alessandro ; Maziero, Carlos ; Richard, Olivier ; Trystram, Denis</creator><creatorcontrib>Kraemer, Alessandro ; Maziero, Carlos ; Richard, Olivier ; Trystram, Denis</creatorcontrib><description>Summary
Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.</description><identifier>ISSN: 1532-0626</identifier><identifier>EISSN: 1532-0634</identifier><identifier>DOI: 10.1002/cpe.4352</identifier><language>eng</language><publisher>Hoboken: Wiley Subscription Services, Inc</publisher><subject>Backfilling ; Cloud computing ; Computer centers ; Computer Science ; Computer simulation ; Convergence ; data center ; Data centers ; Distributed, Parallel, and Cluster Computing ; Dumping ; high performance computing ; Level (quantity) ; Microprocessors ; Migration ; Partitions ; Platforms ; Rescheduling ; Resource scheduling ; Response time ; Scheduling ; scheduling strategy ; Servers ; Simulation ; Violations ; Workload ; Workloads</subject><ispartof>Concurrency and computation, 2018-06, Vol.30 (12), p.n/a</ispartof><rights>Copyright © 2017 John Wiley & Sons, Ltd.</rights><rights>Copyright © 2018 John Wiley & Sons, Ltd.</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</citedby><cites>FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</cites><orcidid>0000-0003-2795-4675 ; 0009-0005-8679-2874 ; 0000-0002-2623-6922</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1002%2Fcpe.4352$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1002%2Fcpe.4352$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>230,314,776,780,881,1411,27903,27904,45553,45554</link.rule.ids><backlink>$$Uhttps://hal.science/hal-02066143$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Kraemer, Alessandro</creatorcontrib><creatorcontrib>Maziero, Carlos</creatorcontrib><creatorcontrib>Richard, Olivier</creatorcontrib><creatorcontrib>Trystram, Denis</creatorcontrib><title>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</title><title>Concurrency and computation</title><description>Summary
Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.</description><subject>Backfilling</subject><subject>Cloud computing</subject><subject>Computer centers</subject><subject>Computer Science</subject><subject>Computer simulation</subject><subject>Convergence</subject><subject>data center</subject><subject>Data centers</subject><subject>Distributed, Parallel, and Cluster Computing</subject><subject>Dumping</subject><subject>high performance computing</subject><subject>Level (quantity)</subject><subject>Microprocessors</subject><subject>Migration</subject><subject>Partitions</subject><subject>Platforms</subject><subject>Rescheduling</subject><subject>Resource scheduling</subject><subject>Response time</subject><subject>Scheduling</subject><subject>scheduling strategy</subject><subject>Servers</subject><subject>Simulation</subject><subject>Violations</subject><subject>Workload</subject><subject>Workloads</subject><issn>1532-0626</issn><issn>1532-0634</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNp10M1KAzEQB_BFFKxV8BECXvSwNd-7PZZSrVCwiJ5DNjttU7abmuyu9OYj-Iw-iamV3jxlCL_MTP5Jck3wgGBM780WBpwJepL0iGA0xZLx02NN5XlyEcIaY0IwI72kfoGyNbZeomYFqG43BXjkFshD2Lo6AGrsBlAA31kDqIIOKuSKNZjGdoA66yrd2AhRsUMamcq15ffn13Q-RsbVHfgl1PFdMKs4pgJ_mZwtdBXg6u_sJ28Pk9fxNJ09Pz6NR7PUUJHRtDRGmFLkhRlqrrVkxGCJy5wVvOSlwCQXMgNdUpFrNmSLDDjFQyINiT8UfMj6yd2h70pXauvtRvudctqq6Wim9neYYikJZx2J9uZgt969txAatXatr-N6imKesUzKPIvq9qCMdyF4WBzbEqz2yauYvNonH2l6oB-2gt2_To3nk1__A-rjhL4</recordid><startdate>20180625</startdate><enddate>20180625</enddate><creator>Kraemer, Alessandro</creator><creator>Maziero, Carlos</creator><creator>Richard, Olivier</creator><creator>Trystram, Denis</creator><general>Wiley Subscription Services, Inc</general><general>Wiley</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>1XC</scope><orcidid>https://orcid.org/0000-0003-2795-4675</orcidid><orcidid>https://orcid.org/0009-0005-8679-2874</orcidid><orcidid>https://orcid.org/0000-0002-2623-6922</orcidid></search><sort><creationdate>20180625</creationdate><title>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</title><author>Kraemer, Alessandro ; Maziero, Carlos ; Richard, Olivier ; Trystram, Denis</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c2572-dcc5cd58bc9a4aa631c060d83b4d4d5018567ead258a393f7e420916c16265493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Backfilling</topic><topic>Cloud computing</topic><topic>Computer centers</topic><topic>Computer Science</topic><topic>Computer simulation</topic><topic>Convergence</topic><topic>data center</topic><topic>Data centers</topic><topic>Distributed, Parallel, and Cluster Computing</topic><topic>Dumping</topic><topic>high performance computing</topic><topic>Level (quantity)</topic><topic>Microprocessors</topic><topic>Migration</topic><topic>Partitions</topic><topic>Platforms</topic><topic>Rescheduling</topic><topic>Resource scheduling</topic><topic>Response time</topic><topic>Scheduling</topic><topic>scheduling strategy</topic><topic>Servers</topic><topic>Simulation</topic><topic>Violations</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kraemer, Alessandro</creatorcontrib><creatorcontrib>Maziero, Carlos</creatorcontrib><creatorcontrib>Richard, Olivier</creatorcontrib><creatorcontrib>Trystram, Denis</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Hyper Article en Ligne (HAL)</collection><jtitle>Concurrency and computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kraemer, Alessandro</au><au>Maziero, Carlos</au><au>Richard, Olivier</au><au>Trystram, Denis</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler</atitle><jtitle>Concurrency and computation</jtitle><date>2018-06-25</date><risdate>2018</risdate><volume>30</volume><issue>12</issue><epage>n/a</epage><issn>1532-0626</issn><eissn>1532-0634</eissn><abstract>Summary
Job scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.</abstract><cop>Hoboken</cop><pub>Wiley Subscription Services, Inc</pub><doi>10.1002/cpe.4352</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0003-2795-4675</orcidid><orcidid>https://orcid.org/0009-0005-8679-2874</orcidid><orcidid>https://orcid.org/0000-0002-2623-6922</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1532-0626 |
ispartof | Concurrency and computation, 2018-06, Vol.30 (12), p.n/a |
issn | 1532-0626 1532-0634 |
language | eng |
recordid | cdi_hal_primary_oai_HAL_hal_02066143v1 |
source | Wiley Online Library Journals Frontfile Complete |
subjects | Backfilling Cloud computing Computer centers Computer Science Computer simulation Convergence data center Data centers Distributed, Parallel, and Cluster Computing Dumping high performance computing Level (quantity) Microprocessors Migration Partitions Platforms Rescheduling Resource scheduling Response time Scheduling scheduling strategy Servers Simulation Violations Workload Workloads |
title | Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T02%3A14%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_hal_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reducing%20the%20number%20of%20response%20time%20service%20level%20objective%20violations%20by%20a%20cloud%E2%80%90HPC%20convergence%20scheduler&rft.jtitle=Concurrency%20and%20computation&rft.au=Kraemer,%20Alessandro&rft.date=2018-06-25&rft.volume=30&rft.issue=12&rft.epage=n/a&rft.issn=1532-0626&rft.eissn=1532-0634&rft_id=info:doi/10.1002/cpe.4352&rft_dat=%3Cproquest_hal_p%3E2047376687%3C/proquest_hal_p%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2047376687&rft_id=info:pmid/&rfr_iscdi=true |