I'm sorry Dave, I'm afraid I can't do that, Deep Q-learning from forbidden action

The use of Reinforcement Learning (RL) is still restricted to simulation or to enhance human-operated systems through recommendations. Real-world environments (e.g. industrial robots or power grids) are generally designed with safety constraints in mind implemented in the shape of valid actions mask...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2020-08
Hauptverfasser:	Seurin, Mathieu, Preux, Philippe, Pietquin, Olivier
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Computer simulation Contingency Industrial robots Machine learning Masks Safety
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Seurin, Mathieu Preux, Philippe Pietquin, Olivier
description	The use of Reinforcement Learning (RL) is still restricted to simulation or to enhance human-operated systems through recommendations. Real-world environments (e.g. industrial robots or power grids) are generally designed with safety constraints in mind implemented in the shape of valid actions masks or contingency controllers. For example, the range of motion and the angles of the motors of a robot can be limited to physical boundaries. Violating constraints thus results in rejected actions or entering in a safe mode driven by an external controller, making RL agents incapable of learning from their mistakes. In this paper, we propose a simple modification of a state-of-the-art deep RL algorithm (DQN), enabling learning from forbidden actions. To do so, the standard Q-learning update is enhanced with an extra safety loss inspired by structured classification. We empirically show that it reduces the number of hit constraints during the learning phase and accelerates convergence to near-optimal policies compared to using standard DQN. Experiments are done on a Visual Grid World Environment and Text-World domain.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2302243196</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2302243196</sourcerecordid><originalsourceid>FETCH-proquest_journals_23022431963</originalsourceid><addsrcrecordid>eNqNikELgjAYQEcQJOV_-KCDFwXb1OqcRR6F7vLltproZtsM-vcV9AM6PR7vzUhAGdsku4zSBQmd69I0pcWW5jkLSF1FAzhj7QtKfIoYvo7SouJQQYs68sAN-Dv6GEohRqiTXqDVSt9AWjOANPaqOBcasPXK6BWZS-ydCH9ckvXpeDmck9GaxyScbzozWf1JDWUppRnb7Av23_UGI9k8sw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2302243196</pqid></control><display><type>article</type><title>I'm sorry Dave, I'm afraid I can't do that, Deep Q-learning from forbidden action</title><source>Free E- Journals</source><creator>Seurin, Mathieu ; Preux, Philippe ; Pietquin, Olivier</creator><creatorcontrib>Seurin, Mathieu ; Preux, Philippe ; Pietquin, Olivier</creatorcontrib><description>The use of Reinforcement Learning (RL) is still restricted to simulation or to enhance human-operated systems through recommendations. Real-world environments (e.g. industrial robots or power grids) are generally designed with safety constraints in mind implemented in the shape of valid actions masks or contingency controllers. For example, the range of motion and the angles of the motors of a robot can be limited to physical boundaries. Violating constraints thus results in rejected actions or entering in a safe mode driven by an external controller, making RL agents incapable of learning from their mistakes. In this paper, we propose a simple modification of a state-of-the-art deep RL algorithm (DQN), enabling learning from forbidden actions. To do so, the standard Q-learning update is enhanced with an extra safety loss inspired by structured classification. We empirically show that it reduces the number of hit constraints during the learning phase and accelerates convergence to near-optimal policies compared to using standard DQN. Experiments are done on a Visual Grid World Environment and Text-World domain.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Computer simulation ; Contingency ; Industrial robots ; Machine learning ; Masks ; Safety</subject><ispartof>arXiv.org, 2020-08</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Seurin, Mathieu</creatorcontrib><creatorcontrib>Preux, Philippe</creatorcontrib><creatorcontrib>Pietquin, Olivier</creatorcontrib><title>I'm sorry Dave, I'm afraid I can't do that, Deep Q-learning from forbidden action</title><title>arXiv.org</title><description>The use of Reinforcement Learning (RL) is still restricted to simulation or to enhance human-operated systems through recommendations. Real-world environments (e.g. industrial robots or power grids) are generally designed with safety constraints in mind implemented in the shape of valid actions masks or contingency controllers. For example, the range of motion and the angles of the motors of a robot can be limited to physical boundaries. Violating constraints thus results in rejected actions or entering in a safe mode driven by an external controller, making RL agents incapable of learning from their mistakes. In this paper, we propose a simple modification of a state-of-the-art deep RL algorithm (DQN), enabling learning from forbidden actions. To do so, the standard Q-learning update is enhanced with an extra safety loss inspired by structured classification. We empirically show that it reduces the number of hit constraints during the learning phase and accelerates convergence to near-optimal policies compared to using standard DQN. Experiments are done on a Visual Grid World Environment and Text-World domain.</description><subject>Algorithms</subject><subject>Computer simulation</subject><subject>Contingency</subject><subject>Industrial robots</subject><subject>Machine learning</subject><subject>Masks</subject><subject>Safety</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNikELgjAYQEcQJOV_-KCDFwXb1OqcRR6F7vLltproZtsM-vcV9AM6PR7vzUhAGdsku4zSBQmd69I0pcWW5jkLSF1FAzhj7QtKfIoYvo7SouJQQYs68sAN-Dv6GEohRqiTXqDVSt9AWjOANPaqOBcasPXK6BWZS-ydCH9ckvXpeDmck9GaxyScbzozWf1JDWUppRnb7Av23_UGI9k8sw</recordid><startdate>20200813</startdate><enddate>20200813</enddate><creator>Seurin, Mathieu</creator><creator>Preux, Philippe</creator><creator>Pietquin, Olivier</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20200813</creationdate><title>I'm sorry Dave, I'm afraid I can't do that, Deep Q-learning from forbidden action</title><author>Seurin, Mathieu ; Preux, Philippe ; Pietquin, Olivier</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_23022431963</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Computer simulation</topic><topic>Contingency</topic><topic>Industrial robots</topic><topic>Machine learning</topic><topic>Masks</topic><topic>Safety</topic><toplevel>online_resources</toplevel><creatorcontrib>Seurin, Mathieu</creatorcontrib><creatorcontrib>Preux, Philippe</creatorcontrib><creatorcontrib>Pietquin, Olivier</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Seurin, Mathieu</au><au>Preux, Philippe</au><au>Pietquin, Olivier</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>I'm sorry Dave, I'm afraid I can't do that, Deep Q-learning from forbidden action</atitle><jtitle>arXiv.org</jtitle><date>2020-08-13</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>The use of Reinforcement Learning (RL) is still restricted to simulation or to enhance human-operated systems through recommendations. Real-world environments (e.g. industrial robots or power grids) are generally designed with safety constraints in mind implemented in the shape of valid actions masks or contingency controllers. For example, the range of motion and the angles of the motors of a robot can be limited to physical boundaries. Violating constraints thus results in rejected actions or entering in a safe mode driven by an external controller, making RL agents incapable of learning from their mistakes. In this paper, we propose a simple modification of a state-of-the-art deep RL algorithm (DQN), enabling learning from forbidden actions. To do so, the standard Q-learning update is enhanced with an extra safety loss inspired by structured classification. We empirically show that it reduces the number of hit constraints during the learning phase and accelerates convergence to near-optimal policies compared to using standard DQN. Experiments are done on a Visual Grid World Environment and Text-World domain.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2020-08
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2302243196
source	Free E- Journals
subjects	Algorithms Computer simulation Contingency Industrial robots Machine learning Masks Safety
title	I'm sorry Dave, I'm afraid I can't do that, Deep Q-learning from forbidden action
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T16%3A43%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=I'm%20sorry%20Dave,%20I'm%20afraid%20I%20can't%20do%20that,%20Deep%20Q-learning%20from%20forbidden%20action&rft.jtitle=arXiv.org&rft.au=Seurin,%20Mathieu&rft.date=2020-08-13&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2302243196%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2302243196&rft_id=info:pmid/&rfr_iscdi=true