Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability

In late 2009, the National Institute for Computational Sciences placed in production the world's fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles availab...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Andrews, Phil, Kovatch, Patricia, Hazlewood, Victor, Baer, Troy
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Aggregates High-performance computing Processor scheduling Production Resource management Runtime scheduling Supercomputers systems software
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	427
container_issue
container_start_page	421
container_title
container_volume
creator	Andrews, Phil Kovatch, Patricia Hazlewood, Victor Baer, Troy
description	In late 2009, the National Institute for Computational Sciences placed in production the world's fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for "hero" users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the "clearing out" of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.
doi_str_mv	10.1109/ICPPW.2010.63
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5599101</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5599101</ieee_id><sourcerecordid>5599101</sourcerecordid><originalsourceid>FETCH-LOGICAL-i90t-8c8dae2035d2ace0cf4db5658ece7605dbe336657bb3d1dfbb97c1c08172fc2e3</originalsourceid><addsrcrecordid>eNotjEtLw0AUhccX2NYuXbmZH2DqvfPMLCVULVQttOKyzCs60jQhTcD66w3o2Ry-78Ah5BphhgjmblGsVu8zBgMrfkLGoJWRAqXWp2TEOGeZVAbOyBgFE0IbzOU5GQEayPgAl2R6OHzBECEZajEiL2v_GUO_S_sPaikC3A4jLeo20nXfxNbXVdN3saVl3dJn-52qvqJvXdqlH9ulek_tPtDCNtYNqjtekYvS7g5x-t8TsnmYb4qnbPn6uCjul1ky0GW5z4ONDLgMzPoIvhTBSSXz6KNWIIOLnCsltXM8YCidM9qjhxw1Kz2LfEJu_m5TjHHbtKmy7XErpTEIyH8BcSJRZg</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Andrews, Phil ; Kovatch, Patricia ; Hazlewood, Victor ; Baer, Troy</creator><creatorcontrib>Andrews, Phil ; Kovatch, Patricia ; Hazlewood, Victor ; Baer, Troy</creatorcontrib><description>In late 2009, the National Institute for Computational Sciences placed in production the world's fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for "hero" users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the "clearing out" of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.</description><identifier>ISSN: 0190-3918</identifier><identifier>ISBN: 1424479185</identifier><identifier>ISBN: 9781424479184</identifier><identifier>EISSN: 2332-5690</identifier><identifier>EISBN: 0769541577</identifier><identifier>EISBN: 9780769541570</identifier><identifier>DOI: 10.1109/ICPPW.2010.63</identifier><language>eng</language><publisher>IEEE</publisher><subject>Aggregates ; High-performance computing ; Processor scheduling ; Production ; Resource management ; Runtime ; scheduling ; Supercomputers ; systems software</subject><ispartof>2010 39th International Conference on Parallel Processing Workshops, 2010, p.421-427</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5599101$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5599101$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Andrews, Phil</creatorcontrib><creatorcontrib>Kovatch, Patricia</creatorcontrib><creatorcontrib>Hazlewood, Victor</creatorcontrib><creatorcontrib>Baer, Troy</creatorcontrib><title>Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability</title><title>2010 39th International Conference on Parallel Processing Workshops</title><addtitle>icppw</addtitle><description>In late 2009, the National Institute for Computational Sciences placed in production the world's fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for "hero" users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the "clearing out" of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.</description><subject>Aggregates</subject><subject>High-performance computing</subject><subject>Processor scheduling</subject><subject>Production</subject><subject>Resource management</subject><subject>Runtime</subject><subject>scheduling</subject><subject>Supercomputers</subject><subject>systems software</subject><issn>0190-3918</issn><issn>2332-5690</issn><isbn>1424479185</isbn><isbn>9781424479184</isbn><isbn>0769541577</isbn><isbn>9780769541570</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2010</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNotjEtLw0AUhccX2NYuXbmZH2DqvfPMLCVULVQttOKyzCs60jQhTcD66w3o2Ry-78Ah5BphhgjmblGsVu8zBgMrfkLGoJWRAqXWp2TEOGeZVAbOyBgFE0IbzOU5GQEayPgAl2R6OHzBECEZajEiL2v_GUO_S_sPaikC3A4jLeo20nXfxNbXVdN3saVl3dJn-52qvqJvXdqlH9ulek_tPtDCNtYNqjtekYvS7g5x-t8TsnmYb4qnbPn6uCjul1ky0GW5z4ONDLgMzPoIvhTBSSXz6KNWIIOLnCsltXM8YCidM9qjhxw1Kz2LfEJu_m5TjHHbtKmy7XErpTEIyH8BcSJRZg</recordid><startdate>201009</startdate><enddate>201009</enddate><creator>Andrews, Phil</creator><creator>Kovatch, Patricia</creator><creator>Hazlewood, Victor</creator><creator>Baer, Troy</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201009</creationdate><title>Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability</title><author>Andrews, Phil ; Kovatch, Patricia ; Hazlewood, Victor ; Baer, Troy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i90t-8c8dae2035d2ace0cf4db5658ece7605dbe336657bb3d1dfbb97c1c08172fc2e3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Aggregates</topic><topic>High-performance computing</topic><topic>Processor scheduling</topic><topic>Production</topic><topic>Resource management</topic><topic>Runtime</topic><topic>scheduling</topic><topic>Supercomputers</topic><topic>systems software</topic><toplevel>online_resources</toplevel><creatorcontrib>Andrews, Phil</creatorcontrib><creatorcontrib>Kovatch, Patricia</creatorcontrib><creatorcontrib>Hazlewood, Victor</creatorcontrib><creatorcontrib>Baer, Troy</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Andrews, Phil</au><au>Kovatch, Patricia</au><au>Hazlewood, Victor</au><au>Baer, Troy</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability</atitle><btitle>2010 39th International Conference on Parallel Processing Workshops</btitle><stitle>icppw</stitle><date>2010-09</date><risdate>2010</risdate><spage>421</spage><epage>427</epage><pages>421-427</pages><issn>0190-3918</issn><eissn>2332-5690</eissn><isbn>1424479185</isbn><isbn>9781424479184</isbn><eisbn>0769541577</eisbn><eisbn>9780769541570</eisbn><abstract>In late 2009, the National Institute for Computational Sciences placed in production the world's fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for "hero" users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the "clearing out" of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.</abstract><pub>IEEE</pub><doi>10.1109/ICPPW.2010.63</doi><tpages>7</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0190-3918
ispartof	2010 39th International Conference on Parallel Processing Workshops, 2010, p.421-427
issn	0190-3918 2332-5690
language	eng
recordid	cdi_ieee_primary_5599101
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Aggregates High-performance computing Processor scheduling Production Resource management Runtime scheduling Supercomputers systems software
title	Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T06%3A59%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Scheduling%20a%20100,000%20Core%20Supercomputer%20for%20Maximum%20Utilization%20and%20Capability&rft.btitle=2010%2039th%20International%20Conference%20on%20Parallel%20Processing%20Workshops&rft.au=Andrews,%20Phil&rft.date=2010-09&rft.spage=421&rft.epage=427&rft.pages=421-427&rft.issn=0190-3918&rft.eissn=2332-5690&rft.isbn=1424479185&rft.isbn_list=9781424479184&rft_id=info:doi/10.1109/ICPPW.2010.63&rft_dat=%3Cieee_6IE%3E5599101%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=0769541577&rft.eisbn_list=9780769541570&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5599101&rfr_iscdi=true