A Machine Learning Approach to Online Fault Classification in HPC Systems

As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2020-07
Hauptverfasser:	Netti, Alessio, Kiziltan, Zeynep, Babaoglu, Ozalp, Sirbu, Alina, Bartolini, Andrea, Borghesi, Andrea
Format:	Artikel
Sprache:	eng
Schlagworte:	Classification Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Learning Datasets Failure rates Fault detection Machine learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Netti, Alessio Kiziltan, Zeynep Babaoglu, Ozalp Sirbu, Alina Bartolini, Andrea Borghesi, Andrea
description	As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.
doi_str_mv	10.48550/arxiv.2007.14241
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2007_14241</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2428261833</sourcerecordid><originalsourceid>FETCH-LOGICAL-a523-1ca159679c0e103ecd211d2d75fc550608b90ba1e834c9e9cd4954d5c84606fa3</originalsourceid><addsrcrecordid>eNotj99LwzAcxIMgOOb-AJ8M-NyafPOj6WMpzg0qE9x7ydJUM7q0Jp24_95u8-ng7jjug9ADJSlXQpBnHX7dTwqEZCnlwOkNmgFjNFEc4A4tYtwTQkBmIASboXWB37T5ct7iyurgnf_ExTCEfjLx2OON787ZUh-7EZedjtG1zujR9R47j1fvJf44xdEe4j26bXUX7eJf52i7fNmWq6TavK7Lokq0AJZQo6nIZZYbYilh1jRAaQNNJlozvZdE7XKy09Qqxk1uc9PwXPBGGMUlka1mc_R4nb1w1kNwBx1O9Zm3vvBOjadrY6L4Pto41vv-GPz0qQYOCiRVjLE_YzpWkw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2428261833</pqid></control><display><type>article</type><title>A Machine Learning Approach to Online Fault Classification in HPC Systems</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Netti, Alessio ; Kiziltan, Zeynep ; Babaoglu, Ozalp ; Sirbu, Alina ; Bartolini, Andrea ; Borghesi, Andrea</creator><creatorcontrib>Netti, Alessio ; Kiziltan, Zeynep ; Babaoglu, Ozalp ; Sirbu, Alina ; Bartolini, Andrea ; Borghesi, Andrea</creatorcontrib><description>As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2007.14241</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Classification ; Computer Science - Distributed, Parallel, and Cluster Computing ; Computer Science - Learning ; Datasets ; Failure rates ; Fault detection ; Machine learning</subject><ispartof>arXiv.org, 2020-07</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,784,885,27925</link.rule.ids><backlink>$$Uhttps://doi.org/10.1016/j.future.2019.11.029$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2007.14241$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Netti, Alessio</creatorcontrib><creatorcontrib>Kiziltan, Zeynep</creatorcontrib><creatorcontrib>Babaoglu, Ozalp</creatorcontrib><creatorcontrib>Sirbu, Alina</creatorcontrib><creatorcontrib>Bartolini, Andrea</creatorcontrib><creatorcontrib>Borghesi, Andrea</creatorcontrib><title>A Machine Learning Approach to Online Fault Classification in HPC Systems</title><title>arXiv.org</title><description>As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.</description><subject>Classification</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><subject>Computer Science - Learning</subject><subject>Datasets</subject><subject>Failure rates</subject><subject>Fault detection</subject><subject>Machine learning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotj99LwzAcxIMgOOb-AJ8M-NyafPOj6WMpzg0qE9x7ydJUM7q0Jp24_95u8-ng7jjug9ADJSlXQpBnHX7dTwqEZCnlwOkNmgFjNFEc4A4tYtwTQkBmIASboXWB37T5ct7iyurgnf_ExTCEfjLx2OON787ZUh-7EZedjtG1zujR9R47j1fvJf44xdEe4j26bXUX7eJf52i7fNmWq6TavK7Lokq0AJZQo6nIZZYbYilh1jRAaQNNJlozvZdE7XKy09Qqxk1uc9PwXPBGGMUlka1mc_R4nb1w1kNwBx1O9Zm3vvBOjadrY6L4Pto41vv-GPz0qQYOCiRVjLE_YzpWkw</recordid><startdate>20200727</startdate><enddate>20200727</enddate><creator>Netti, Alessio</creator><creator>Kiziltan, Zeynep</creator><creator>Babaoglu, Ozalp</creator><creator>Sirbu, Alina</creator><creator>Bartolini, Andrea</creator><creator>Borghesi, Andrea</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200727</creationdate><title>A Machine Learning Approach to Online Fault Classification in HPC Systems</title><author>Netti, Alessio ; Kiziltan, Zeynep ; Babaoglu, Ozalp ; Sirbu, Alina ; Bartolini, Andrea ; Borghesi, Andrea</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a523-1ca159679c0e103ecd211d2d75fc550608b90ba1e834c9e9cd4954d5c84606fa3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Classification</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><topic>Computer Science - Learning</topic><topic>Datasets</topic><topic>Failure rates</topic><topic>Fault detection</topic><topic>Machine learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Netti, Alessio</creatorcontrib><creatorcontrib>Kiziltan, Zeynep</creatorcontrib><creatorcontrib>Babaoglu, Ozalp</creatorcontrib><creatorcontrib>Sirbu, Alina</creatorcontrib><creatorcontrib>Bartolini, Andrea</creatorcontrib><creatorcontrib>Borghesi, Andrea</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Netti, Alessio</au><au>Kiziltan, Zeynep</au><au>Babaoglu, Ozalp</au><au>Sirbu, Alina</au><au>Bartolini, Andrea</au><au>Borghesi, Andrea</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Machine Learning Approach to Online Fault Classification in HPC Systems</atitle><jtitle>arXiv.org</jtitle><date>2020-07-27</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2007.14241</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2020-07
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2007_14241
source	arXiv.org; Free E- Journals
subjects	Classification Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Learning Datasets Failure rates Fault detection Machine learning
title	A Machine Learning Approach to Online Fault Classification in HPC Systems
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T05%3A20%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Machine%20Learning%20Approach%20to%20Online%20Fault%20Classification%20in%20HPC%20Systems&rft.jtitle=arXiv.org&rft.au=Netti,%20Alessio&rft.date=2020-07-27&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2007.14241&rft_dat=%3Cproquest_arxiv%3E2428261833%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2428261833&rft_id=info:pmid/&rfr_iscdi=true