Automated Performance Modeling of HPC Applications Using Machine Learning

Automated performance modeling and performance prediction of parallel programs are highly valuable in many use cases, such as in guiding task management and job scheduling, offering insights of application behaviors, and assisting resource requirement estimation. The performance of parallel programs...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computers 2020-05, Vol.69 (5), p.749-763
Hauptverfasser:	Sun, Jingwei, Sun, Guangzhong, Zhan, Shiyan, Zhang, Jiepeng, Chen, Yong
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Automation Computational modeling Data collection Data models Domains Empirical analysis Feature extraction Instruments Machine learning Model accuracy model transferring Modelling Parallel computing Parallel programming performance modeling Performance prediction Predictive models Resource management Run time (computers) Runtime Task scheduling
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	763
container_issue	5
container_start_page	749
container_title	IEEE transactions on computers
container_volume	69
creator	Sun, Jingwei Sun, Guangzhong Zhan, Shiyan Zhang, Jiepeng Chen, Yong
description	Automated performance modeling and performance prediction of parallel programs are highly valuable in many use cases, such as in guiding task management and job scheduling, offering insights of application behaviors, and assisting resource requirement estimation. The performance of parallel programs is affected by numerous factors, including but not limited to hardware, applications, algorithms, and input parameters, thus an accurate performance prediction is often a challenging and daunting task. In this article, we focus on automatically predicting the execution time of parallel programs (more specifically, MPI programs) with different inputs, at different scales, and without domain knowledge. We model the correlation between the execution time and domain-independent runtime features. These features include values of variables, counters of branches, loops, and MPI communications. Through automatically instrumenting an MPI program, each execution of the program will output a feature vector and its corresponding execution time. After collecting data from executions with different inputs, a random forest machine learning approach is used to build an empirical performance model, which can predict the execution time of the program given a new input. A transfer learning method is used to reuse an existing performance model and improve the prediction accuracy on a new platform that lacks historical execution data. Our experiments and analyses of three parallel applications, Graph500, GalaxSee, and SMG2000, on three different systems confirm that our method performs well, with less than 20 percent prediction error on average.
doi_str_mv	10.1109/TC.2020.2964767
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TC_2020_2964767</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8956059</ieee_id><sourcerecordid>2388101282</sourcerecordid><originalsourceid>FETCH-LOGICAL-c355t-8b1438ed37bd4d9801076683ffb1044ba2b8faf5d66e9ef0357f1b13f2ce9df23</originalsourceid><addsrcrecordid>eNo9kDFPwzAQRi0EEqUwM7BEYk57tmPHHqsIaKVWdGhny0nOkKqNg50O_HsStWI66bv33UmPkGcKM0pBz3fFjAGDGdMyy2V-QyZUiDzVWshbMgGgKtU8g3vyEOMBACQDPSGrxbn3J9tjnWwxOB9Otq0w2fgaj037lXiXLLdFsui6Y1PZvvFtTPZx3Gxs9d20mKzRhnYIHsmds8eIT9c5Jfv3t12xTNefH6tisU4rLkSfqpJmXGHN87LOaq2AQi6l4s6VFLKstKxUzjpRS4kaHXCRO1pS7liFunaMT8nr5W4X_M8ZY28O_hza4aVhXCkKlKmRml-oKvgYAzrTheZkw6-hYEZfZleY0Ze5-hoaL5dGg4j_tBr0gdD8D9GWZZc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2388101282</pqid></control><display><type>article</type><title>Automated Performance Modeling of HPC Applications Using Machine Learning</title><source>IEEE Electronic Library (IEL)</source><creator>Sun, Jingwei ; Sun, Guangzhong ; Zhan, Shiyan ; Zhang, Jiepeng ; Chen, Yong</creator><creatorcontrib>Sun, Jingwei ; Sun, Guangzhong ; Zhan, Shiyan ; Zhang, Jiepeng ; Chen, Yong</creatorcontrib><description>Automated performance modeling and performance prediction of parallel programs are highly valuable in many use cases, such as in guiding task management and job scheduling, offering insights of application behaviors, and assisting resource requirement estimation. The performance of parallel programs is affected by numerous factors, including but not limited to hardware, applications, algorithms, and input parameters, thus an accurate performance prediction is often a challenging and daunting task. In this article, we focus on automatically predicting the execution time of parallel programs (more specifically, MPI programs) with different inputs, at different scales, and without domain knowledge. We model the correlation between the execution time and domain-independent runtime features. These features include values of variables, counters of branches, loops, and MPI communications. Through automatically instrumenting an MPI program, each execution of the program will output a feature vector and its corresponding execution time. After collecting data from executions with different inputs, a random forest machine learning approach is used to build an empirical performance model, which can predict the execution time of the program given a new input. A transfer learning method is used to reuse an existing performance model and improve the prediction accuracy on a new platform that lacks historical execution data. Our experiments and analyses of three parallel applications, Graph500, GalaxSee, and SMG2000, on three different systems confirm that our method performs well, with less than 20 percent prediction error on average.</description><identifier>ISSN: 0018-9340</identifier><identifier>EISSN: 1557-9956</identifier><identifier>DOI: 10.1109/TC.2020.2964767</identifier><identifier>CODEN: ITCOB4</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Algorithms ; Automation ; Computational modeling ; Data collection ; Data models ; Domains ; Empirical analysis ; Feature extraction ; Instruments ; Machine learning ; Model accuracy ; model transferring ; Modelling ; Parallel computing ; Parallel programming ; performance modeling ; Performance prediction ; Predictive models ; Resource management ; Run time (computers) ; Runtime ; Task scheduling</subject><ispartof>IEEE transactions on computers, 2020-05, Vol.69 (5), p.749-763</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c355t-8b1438ed37bd4d9801076683ffb1044ba2b8faf5d66e9ef0357f1b13f2ce9df23</citedby><cites>FETCH-LOGICAL-c355t-8b1438ed37bd4d9801076683ffb1044ba2b8faf5d66e9ef0357f1b13f2ce9df23</cites><orcidid>0000-0001-5098-1503 ; 0000-0002-9961-9051</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8956059$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8956059$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Sun, Jingwei</creatorcontrib><creatorcontrib>Sun, Guangzhong</creatorcontrib><creatorcontrib>Zhan, Shiyan</creatorcontrib><creatorcontrib>Zhang, Jiepeng</creatorcontrib><creatorcontrib>Chen, Yong</creatorcontrib><title>Automated Performance Modeling of HPC Applications Using Machine Learning</title><title>IEEE transactions on computers</title><addtitle>TC</addtitle><description>Automated performance modeling and performance prediction of parallel programs are highly valuable in many use cases, such as in guiding task management and job scheduling, offering insights of application behaviors, and assisting resource requirement estimation. The performance of parallel programs is affected by numerous factors, including but not limited to hardware, applications, algorithms, and input parameters, thus an accurate performance prediction is often a challenging and daunting task. In this article, we focus on automatically predicting the execution time of parallel programs (more specifically, MPI programs) with different inputs, at different scales, and without domain knowledge. We model the correlation between the execution time and domain-independent runtime features. These features include values of variables, counters of branches, loops, and MPI communications. Through automatically instrumenting an MPI program, each execution of the program will output a feature vector and its corresponding execution time. After collecting data from executions with different inputs, a random forest machine learning approach is used to build an empirical performance model, which can predict the execution time of the program given a new input. A transfer learning method is used to reuse an existing performance model and improve the prediction accuracy on a new platform that lacks historical execution data. Our experiments and analyses of three parallel applications, Graph500, GalaxSee, and SMG2000, on three different systems confirm that our method performs well, with less than 20 percent prediction error on average.</description><subject>Algorithms</subject><subject>Automation</subject><subject>Computational modeling</subject><subject>Data collection</subject><subject>Data models</subject><subject>Domains</subject><subject>Empirical analysis</subject><subject>Feature extraction</subject><subject>Instruments</subject><subject>Machine learning</subject><subject>Model accuracy</subject><subject>model transferring</subject><subject>Modelling</subject><subject>Parallel computing</subject><subject>Parallel programming</subject><subject>performance modeling</subject><subject>Performance prediction</subject><subject>Predictive models</subject><subject>Resource management</subject><subject>Run time (computers)</subject><subject>Runtime</subject><subject>Task scheduling</subject><issn>0018-9340</issn><issn>1557-9956</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kDFPwzAQRi0EEqUwM7BEYk57tmPHHqsIaKVWdGhny0nOkKqNg50O_HsStWI66bv33UmPkGcKM0pBz3fFjAGDGdMyy2V-QyZUiDzVWshbMgGgKtU8g3vyEOMBACQDPSGrxbn3J9tjnWwxOB9Otq0w2fgaj037lXiXLLdFsui6Y1PZvvFtTPZx3Gxs9d20mKzRhnYIHsmds8eIT9c5Jfv3t12xTNefH6tisU4rLkSfqpJmXGHN87LOaq2AQi6l4s6VFLKstKxUzjpRS4kaHXCRO1pS7liFunaMT8nr5W4X_M8ZY28O_hza4aVhXCkKlKmRml-oKvgYAzrTheZkw6-hYEZfZleY0Ze5-hoaL5dGg4j_tBr0gdD8D9GWZZc</recordid><startdate>20200501</startdate><enddate>20200501</enddate><creator>Sun, Jingwei</creator><creator>Sun, Guangzhong</creator><creator>Zhan, Shiyan</creator><creator>Zhang, Jiepeng</creator><creator>Chen, Yong</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-5098-1503</orcidid><orcidid>https://orcid.org/0000-0002-9961-9051</orcidid></search><sort><creationdate>20200501</creationdate><title>Automated Performance Modeling of HPC Applications Using Machine Learning</title><author>Sun, Jingwei ; Sun, Guangzhong ; Zhan, Shiyan ; Zhang, Jiepeng ; Chen, Yong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c355t-8b1438ed37bd4d9801076683ffb1044ba2b8faf5d66e9ef0357f1b13f2ce9df23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Automation</topic><topic>Computational modeling</topic><topic>Data collection</topic><topic>Data models</topic><topic>Domains</topic><topic>Empirical analysis</topic><topic>Feature extraction</topic><topic>Instruments</topic><topic>Machine learning</topic><topic>Model accuracy</topic><topic>model transferring</topic><topic>Modelling</topic><topic>Parallel computing</topic><topic>Parallel programming</topic><topic>performance modeling</topic><topic>Performance prediction</topic><topic>Predictive models</topic><topic>Resource management</topic><topic>Run time (computers)</topic><topic>Runtime</topic><topic>Task scheduling</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sun, Jingwei</creatorcontrib><creatorcontrib>Sun, Guangzhong</creatorcontrib><creatorcontrib>Zhan, Shiyan</creatorcontrib><creatorcontrib>Zhang, Jiepeng</creatorcontrib><creatorcontrib>Chen, Yong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on computers</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Sun, Jingwei</au><au>Sun, Guangzhong</au><au>Zhan, Shiyan</au><au>Zhang, Jiepeng</au><au>Chen, Yong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Automated Performance Modeling of HPC Applications Using Machine Learning</atitle><jtitle>IEEE transactions on computers</jtitle><stitle>TC</stitle><date>2020-05-01</date><risdate>2020</risdate><volume>69</volume><issue>5</issue><spage>749</spage><epage>763</epage><pages>749-763</pages><issn>0018-9340</issn><eissn>1557-9956</eissn><coden>ITCOB4</coden><abstract>Automated performance modeling and performance prediction of parallel programs are highly valuable in many use cases, such as in guiding task management and job scheduling, offering insights of application behaviors, and assisting resource requirement estimation. The performance of parallel programs is affected by numerous factors, including but not limited to hardware, applications, algorithms, and input parameters, thus an accurate performance prediction is often a challenging and daunting task. In this article, we focus on automatically predicting the execution time of parallel programs (more specifically, MPI programs) with different inputs, at different scales, and without domain knowledge. We model the correlation between the execution time and domain-independent runtime features. These features include values of variables, counters of branches, loops, and MPI communications. Through automatically instrumenting an MPI program, each execution of the program will output a feature vector and its corresponding execution time. After collecting data from executions with different inputs, a random forest machine learning approach is used to build an empirical performance model, which can predict the execution time of the program given a new input. A transfer learning method is used to reuse an existing performance model and improve the prediction accuracy on a new platform that lacks historical execution data. Our experiments and analyses of three parallel applications, Graph500, GalaxSee, and SMG2000, on three different systems confirm that our method performs well, with less than 20 percent prediction error on average.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TC.2020.2964767</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0001-5098-1503</orcidid><orcidid>https://orcid.org/0000-0002-9961-9051</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0018-9340
ispartof	IEEE transactions on computers, 2020-05, Vol.69 (5), p.749-763
issn	0018-9340 1557-9956
language	eng
recordid	cdi_crossref_primary_10_1109_TC_2020_2964767
source	IEEE Electronic Library (IEL)
subjects	Algorithms Automation Computational modeling Data collection Data models Domains Empirical analysis Feature extraction Instruments Machine learning Model accuracy model transferring Modelling Parallel computing Parallel programming performance modeling Performance prediction Predictive models Resource management Run time (computers) Runtime Task scheduling
title	Automated Performance Modeling of HPC Applications Using Machine Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T12%3A08%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Automated%20Performance%20Modeling%20of%20HPC%20Applications%20Using%20Machine%20Learning&rft.jtitle=IEEE%20transactions%20on%20computers&rft.au=Sun,%20Jingwei&rft.date=2020-05-01&rft.volume=69&rft.issue=5&rft.spage=749&rft.epage=763&rft.pages=749-763&rft.issn=0018-9340&rft.eissn=1557-9956&rft.coden=ITCOB4&rft_id=info:doi/10.1109/TC.2020.2964767&rft_dat=%3Cproquest_RIE%3E2388101282%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2388101282&rft_id=info:pmid/&rft_ieee_id=8956059&rfr_iscdi=true