FRL-FI: Transient Fault Analysis for Federated Reinforcement Learning-Based Navigation Systems

Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving priva...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wan, Zishen, Anwar, Aqeel, Mahmoud, Abdulrahman, Jia, Tianyu, Hsiao, Yu-Shun, Reddi, Vijay Janapa, Raychowdhury, Arijit
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Hardware Architecture Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Wan, Zishen Anwar, Aqeel Mahmoud, Abdulrahman Jia, Tianyu Hsiao, Yu-Shun Reddi, Vijay Janapa Raychowdhury, Arijit
description	Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with
doi_str_mv	10.48550/arxiv.2203.07276
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2203_07276</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2203_07276</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-78473ea1c894584e6250bc5cdae0fd8f107ad91e69b7a69b14e7780a2bf5b0343</originalsourceid><addsrcrecordid>eNotj89Kw0AYxPfiQaoP4Ml9gY2b7N94q8VoISi0OTd8Sb6UhWQru7GYtzetXmZgZhj4EfKQ8kRapfgThB93TrKMi4SbzOhbcih2JSu2z7QK4KNDP9ECvoeJrj0Mc3SR9qdAC-wwwIQd3aHzS9LieJmWCME7f2QvEJfyA87uCJM7ebqf44RjvCM3PQwR7_99Raritdq8s_LzbbtZlwy00cxYaQRC2tpcKitRZ4o3rWo7QN53tk-5gS5PUeeNgUVSicZYDlnTq4YLKVbk8e_2Clh_BTdCmOsLaH0FFb-bkk4-</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>FRL-FI: Transient Fault Analysis for Federated Reinforcement Learning-Based Navigation Systems</title><source>arXiv.org</source><creator>Wan, Zishen ; Anwar, Aqeel ; Mahmoud, Abdulrahman ; Jia, Tianyu ; Hsiao, Yu-Shun ; Reddi, Vijay Janapa ; Raychowdhury, Arijit</creator><creatorcontrib>Wan, Zishen ; Anwar, Aqeel ; Mahmoud, Abdulrahman ; Jia, Tianyu ; Hsiao, Yu-Shun ; Reddi, Vijay Janapa ; Raychowdhury, Arijit</creatorcontrib><description>Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with <2.7% overhead in FRL systems.</description><identifier>DOI: 10.48550/arxiv.2203.07276</identifier><language>eng</language><subject>Computer Science - Hardware Architecture ; Computer Science - Learning</subject><creationdate>2022-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2203.07276$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2203.07276$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wan, Zishen</creatorcontrib><creatorcontrib>Anwar, Aqeel</creatorcontrib><creatorcontrib>Mahmoud, Abdulrahman</creatorcontrib><creatorcontrib>Jia, Tianyu</creatorcontrib><creatorcontrib>Hsiao, Yu-Shun</creatorcontrib><creatorcontrib>Reddi, Vijay Janapa</creatorcontrib><creatorcontrib>Raychowdhury, Arijit</creatorcontrib><title>FRL-FI: Transient Fault Analysis for Federated Reinforcement Learning-Based Navigation Systems</title><description>Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with <2.7% overhead in FRL systems.</description><subject>Computer Science - Hardware Architecture</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj89Kw0AYxPfiQaoP4Ml9gY2b7N94q8VoISi0OTd8Sb6UhWQru7GYtzetXmZgZhj4EfKQ8kRapfgThB93TrKMi4SbzOhbcih2JSu2z7QK4KNDP9ECvoeJrj0Mc3SR9qdAC-wwwIQd3aHzS9LieJmWCME7f2QvEJfyA87uCJM7ebqf44RjvCM3PQwR7_99Raritdq8s_LzbbtZlwy00cxYaQRC2tpcKitRZ4o3rWo7QN53tk-5gS5PUeeNgUVSicZYDlnTq4YLKVbk8e_2Clh_BTdCmOsLaH0FFb-bkk4-</recordid><startdate>20220314</startdate><enddate>20220314</enddate><creator>Wan, Zishen</creator><creator>Anwar, Aqeel</creator><creator>Mahmoud, Abdulrahman</creator><creator>Jia, Tianyu</creator><creator>Hsiao, Yu-Shun</creator><creator>Reddi, Vijay Janapa</creator><creator>Raychowdhury, Arijit</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220314</creationdate><title>FRL-FI: Transient Fault Analysis for Federated Reinforcement Learning-Based Navigation Systems</title><author>Wan, Zishen ; Anwar, Aqeel ; Mahmoud, Abdulrahman ; Jia, Tianyu ; Hsiao, Yu-Shun ; Reddi, Vijay Janapa ; Raychowdhury, Arijit</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-78473ea1c894584e6250bc5cdae0fd8f107ad91e69b7a69b14e7780a2bf5b0343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Hardware Architecture</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Wan, Zishen</creatorcontrib><creatorcontrib>Anwar, Aqeel</creatorcontrib><creatorcontrib>Mahmoud, Abdulrahman</creatorcontrib><creatorcontrib>Jia, Tianyu</creatorcontrib><creatorcontrib>Hsiao, Yu-Shun</creatorcontrib><creatorcontrib>Reddi, Vijay Janapa</creatorcontrib><creatorcontrib>Raychowdhury, Arijit</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wan, Zishen</au><au>Anwar, Aqeel</au><au>Mahmoud, Abdulrahman</au><au>Jia, Tianyu</au><au>Hsiao, Yu-Shun</au><au>Reddi, Vijay Janapa</au><au>Raychowdhury, Arijit</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>FRL-FI: Transient Fault Analysis for Federated Reinforcement Learning-Based Navigation Systems</atitle><date>2022-03-14</date><risdate>2022</risdate><abstract>Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with <2.7% overhead in FRL systems.</abstract><doi>10.48550/arxiv.2203.07276</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2203.07276
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2203_07276
source	arXiv.org
subjects	Computer Science - Hardware Architecture Computer Science - Learning
title	FRL-FI: Transient Fault Analysis for Federated Reinforcement Learning-Based Navigation Systems
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T08%3A27%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=FRL-FI:%20Transient%20Fault%20Analysis%20for%20Federated%20Reinforcement%20Learning-Based%20Navigation%20Systems&rft.au=Wan,%20Zishen&rft.date=2022-03-14&rft_id=info:doi/10.48550/arxiv.2203.07276&rft_dat=%3Carxiv_GOX%3E2203_07276%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true