Exploring and Exploiting Conditioning of Reinforcement Learning Agents

The outcome of Jacobian singular values regularization was studied for supervised learning problems. In supervised learning settings for linear and nonlinear networks, Jacobian regularization allows for faster learning. It also was shown that Jacobian conditioning regularization can help to avoid th...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2020, Vol.8, p.211951-211960
Hauptverfasser:	Asadulaev, Arip, Kuznetsov, Igor, Stein, Gideon, Filchenkov, Andrey
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Approximation algorithms Computer architecture conditioning Control tasks generalization Jacobian matrices Machine learning neural networks Optimization policy optimization Questions Regularization Reinforcement learning Shape Supervised learning Task analysis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	211960
container_issue
container_start_page	211951
container_title	IEEE access
container_volume	8
creator	Asadulaev, Arip Kuznetsov, Igor Stein, Gideon Filchenkov, Andrey
description	The outcome of Jacobian singular values regularization was studied for supervised learning problems. In supervised learning settings for linear and nonlinear networks, Jacobian regularization allows for faster learning. It also was shown that Jacobian conditioning regularization can help to avoid the "mode-collapse" problem in Generative Adversarial Networks. In this paper, we try to answer the following question: Can information about policy network Jacobian conditioning help to shape a more stable and general policy of reinforcement learning agents? To answer this question, we conduct a study of Jacobian conditioning behavior during policy optimization. We analyze the behavior of the agent conditioning on different policies under the different sets of hyperparameters and study a correspondence between the conditioning and the ratio of achieved rewards. Based on these observations, we propose a conditioning regularization technique. We apply it to Trust Region Policy Optimization and Proximal Policy Optimization (PPO) algorithms and compare their performance on 8 continuous control tasks. Models with the proposed regularization outperformed other models on most of the tasks. Also, we showed that the regularization improves the agent's generalization by comparing the PPO performance on CoinRun environments. Also, we propose an algorithm that uses the condition number of the agent to form a robust policy, which we call Jacobian Policy Optimization (JPO). It directly estimates the condition number of an agent's Jacobian and changes the policy trend. We compare it with PPO on several continuous control tasks in PyBullet environments and the proposed algorithm provides a more stable and efficient reward growth on a range of agents.
doi_str_mv	10.1109/ACCESS.2020.3037276
format	Article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9256259</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9256259</ieee_id><doaj_id>oai_doaj_org_article_6d8ccc5c5ba14f948b2d50bf8e626e75</doaj_id><sourcerecordid>2468751374</sourcerecordid><originalsourceid>FETCH-LOGICAL-c435t-726960075566a4d41f3eef1ae6a18ee7eef90190b1dd4aef85de52882dcf4c153</originalsourceid><addsrcrecordid>eNpNUV1rGzEQPEIDCUl-QV5M-2xXq-97NIfTBgyFpH0WsrRyZBzJlS7Q_vvqfCFUL7uzmhl2ma67B7ICIP3X9TBsnp9XlFCyYoQpquRFd01B9ksmmPz0X3_V3dV6IO3pNhLqunvY_Dkdc4lpv7DJL84ojhMccvKty2kCOSyeMKaQi8NXTONii7acf9b7ButtdxnsseLde73pfj1sfg7fl9sf3x6H9XbpOBPjUlHZS0KUEFJa7jkEhhjAorSgEVUDPYGe7MB7bjFo4VFQral3gTsQ7KZ7nH19tgdzKvHVlr8m22jOg1z2xpYxuiMa6bVzTjixs8BDz_WOekF2QaOkEtXk9Xn2ynWMpro4ontxOSV0owENAEI10peZdCr59xvW0RzyW0ntRkO51EoAU7yx2MxyJddaMHysBsRMIZk5JDOFZN5Daqr7WRUR8UPRUyGp6Nk_Pj6M_Q</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2468751374</pqid></control><display><type>article</type><title>Exploring and Exploiting Conditioning of Reinforcement Learning Agents</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Asadulaev, Arip ; Kuznetsov, Igor ; Stein, Gideon ; Filchenkov, Andrey</creator><creatorcontrib>Asadulaev, Arip ; Kuznetsov, Igor ; Stein, Gideon ; Filchenkov, Andrey</creatorcontrib><description>The outcome of Jacobian singular values regularization was studied for supervised learning problems. In supervised learning settings for linear and nonlinear networks, Jacobian regularization allows for faster learning. It also was shown that Jacobian conditioning regularization can help to avoid the "mode-collapse" problem in Generative Adversarial Networks. In this paper, we try to answer the following question: Can information about policy network Jacobian conditioning help to shape a more stable and general policy of reinforcement learning agents? To answer this question, we conduct a study of Jacobian conditioning behavior during policy optimization. We analyze the behavior of the agent conditioning on different policies under the different sets of hyperparameters and study a correspondence between the conditioning and the ratio of achieved rewards. Based on these observations, we propose a conditioning regularization technique. We apply it to Trust Region Policy Optimization and Proximal Policy Optimization (PPO) algorithms and compare their performance on 8 continuous control tasks. Models with the proposed regularization outperformed other models on most of the tasks. Also, we showed that the regularization improves the agent's generalization by comparing the PPO performance on CoinRun environments. Also, we propose an algorithm that uses the condition number of the agent to form a robust policy, which we call Jacobian Policy Optimization (JPO). It directly estimates the condition number of an agent's Jacobian and changes the policy trend. We compare it with PPO on several continuous control tasks in PyBullet environments and the proposed algorithm provides a more stable and efficient reward growth on a range of agents.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2020.3037276</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; Approximation algorithms ; Computer architecture ; conditioning ; Control tasks ; generalization ; Jacobian matrices ; Machine learning ; neural networks ; Optimization ; policy optimization ; Questions ; Regularization ; Reinforcement learning ; Shape ; Supervised learning ; Task analysis</subject><ispartof>IEEE access, 2020, Vol.8, p.211951-211960</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c435t-726960075566a4d41f3eef1ae6a18ee7eef90190b1dd4aef85de52882dcf4c153</citedby><cites>FETCH-LOGICAL-c435t-726960075566a4d41f3eef1ae6a18ee7eef90190b1dd4aef85de52882dcf4c153</cites><orcidid>0000-0002-2581-935X ; 0000-0002-2735-1842 ; 0000000227351842 ; 000000022581935X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9256259$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>230,314,776,780,860,881,2096,4010,27610,27900,27901,27902,54908</link.rule.ids><backlink>$$Uhttps://www.osti.gov/biblio/1811157$$D View this record in Osti.gov$$Hfree_for_read</backlink></links><search><creatorcontrib>Asadulaev, Arip</creatorcontrib><creatorcontrib>Kuznetsov, Igor</creatorcontrib><creatorcontrib>Stein, Gideon</creatorcontrib><creatorcontrib>Filchenkov, Andrey</creatorcontrib><title>Exploring and Exploiting Conditioning of Reinforcement Learning Agents</title><title>IEEE access</title><addtitle>Access</addtitle><description>The outcome of Jacobian singular values regularization was studied for supervised learning problems. In supervised learning settings for linear and nonlinear networks, Jacobian regularization allows for faster learning. It also was shown that Jacobian conditioning regularization can help to avoid the "mode-collapse" problem in Generative Adversarial Networks. In this paper, we try to answer the following question: Can information about policy network Jacobian conditioning help to shape a more stable and general policy of reinforcement learning agents? To answer this question, we conduct a study of Jacobian conditioning behavior during policy optimization. We analyze the behavior of the agent conditioning on different policies under the different sets of hyperparameters and study a correspondence between the conditioning and the ratio of achieved rewards. Based on these observations, we propose a conditioning regularization technique. We apply it to Trust Region Policy Optimization and Proximal Policy Optimization (PPO) algorithms and compare their performance on 8 continuous control tasks. Models with the proposed regularization outperformed other models on most of the tasks. Also, we showed that the regularization improves the agent's generalization by comparing the PPO performance on CoinRun environments. Also, we propose an algorithm that uses the condition number of the agent to form a robust policy, which we call Jacobian Policy Optimization (JPO). It directly estimates the condition number of an agent's Jacobian and changes the policy trend. We compare it with PPO on several continuous control tasks in PyBullet environments and the proposed algorithm provides a more stable and efficient reward growth on a range of agents.</description><subject>Algorithms</subject><subject>Approximation algorithms</subject><subject>Computer architecture</subject><subject>conditioning</subject><subject>Control tasks</subject><subject>generalization</subject><subject>Jacobian matrices</subject><subject>Machine learning</subject><subject>neural networks</subject><subject>Optimization</subject><subject>policy optimization</subject><subject>Questions</subject><subject>Regularization</subject><subject>Reinforcement learning</subject><subject>Shape</subject><subject>Supervised learning</subject><subject>Task analysis</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUV1rGzEQPEIDCUl-QV5M-2xXq-97NIfTBgyFpH0WsrRyZBzJlS7Q_vvqfCFUL7uzmhl2ma67B7ICIP3X9TBsnp9XlFCyYoQpquRFd01B9ksmmPz0X3_V3dV6IO3pNhLqunvY_Dkdc4lpv7DJL84ojhMccvKty2kCOSyeMKaQi8NXTONii7acf9b7ButtdxnsseLde73pfj1sfg7fl9sf3x6H9XbpOBPjUlHZS0KUEFJa7jkEhhjAorSgEVUDPYGe7MB7bjFo4VFQral3gTsQ7KZ7nH19tgdzKvHVlr8m22jOg1z2xpYxuiMa6bVzTjixs8BDz_WOekF2QaOkEtXk9Xn2ynWMpro4ontxOSV0owENAEI10peZdCr59xvW0RzyW0ntRkO51EoAU7yx2MxyJddaMHysBsRMIZk5JDOFZN5Daqr7WRUR8UPRUyGp6Nk_Pj6M_Q</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Asadulaev, Arip</creator><creator>Kuznetsov, Igor</creator><creator>Stein, Gideon</creator><creator>Filchenkov, Andrey</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><general>Institute of Electrical and Electronics Engineers</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>OTOTI</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-2581-935X</orcidid><orcidid>https://orcid.org/0000-0002-2735-1842</orcidid><orcidid>https://orcid.org/0000000227351842</orcidid><orcidid>https://orcid.org/000000022581935X</orcidid></search><sort><creationdate>2020</creationdate><title>Exploring and Exploiting Conditioning of Reinforcement Learning Agents</title><author>Asadulaev, Arip ; Kuznetsov, Igor ; Stein, Gideon ; Filchenkov, Andrey</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c435t-726960075566a4d41f3eef1ae6a18ee7eef90190b1dd4aef85de52882dcf4c153</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Approximation algorithms</topic><topic>Computer architecture</topic><topic>conditioning</topic><topic>Control tasks</topic><topic>generalization</topic><topic>Jacobian matrices</topic><topic>Machine learning</topic><topic>neural networks</topic><topic>Optimization</topic><topic>policy optimization</topic><topic>Questions</topic><topic>Regularization</topic><topic>Reinforcement learning</topic><topic>Shape</topic><topic>Supervised learning</topic><topic>Task analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Asadulaev, Arip</creatorcontrib><creatorcontrib>Kuznetsov, Igor</creatorcontrib><creatorcontrib>Stein, Gideon</creatorcontrib><creatorcontrib>Filchenkov, Andrey</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>OSTI.GOV</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Asadulaev, Arip</au><au>Kuznetsov, Igor</au><au>Stein, Gideon</au><au>Filchenkov, Andrey</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Exploring and Exploiting Conditioning of Reinforcement Learning Agents</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2020</date><risdate>2020</risdate><volume>8</volume><spage>211951</spage><epage>211960</epage><pages>211951-211960</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>The outcome of Jacobian singular values regularization was studied for supervised learning problems. In supervised learning settings for linear and nonlinear networks, Jacobian regularization allows for faster learning. It also was shown that Jacobian conditioning regularization can help to avoid the "mode-collapse" problem in Generative Adversarial Networks. In this paper, we try to answer the following question: Can information about policy network Jacobian conditioning help to shape a more stable and general policy of reinforcement learning agents? To answer this question, we conduct a study of Jacobian conditioning behavior during policy optimization. We analyze the behavior of the agent conditioning on different policies under the different sets of hyperparameters and study a correspondence between the conditioning and the ratio of achieved rewards. Based on these observations, we propose a conditioning regularization technique. We apply it to Trust Region Policy Optimization and Proximal Policy Optimization (PPO) algorithms and compare their performance on 8 continuous control tasks. Models with the proposed regularization outperformed other models on most of the tasks. Also, we showed that the regularization improves the agent's generalization by comparing the PPO performance on CoinRun environments. Also, we propose an algorithm that uses the condition number of the agent to form a robust policy, which we call Jacobian Policy Optimization (JPO). It directly estimates the condition number of an agent's Jacobian and changes the policy trend. We compare it with PPO on several continuous control tasks in PyBullet environments and the proposed algorithm provides a more stable and efficient reward growth on a range of agents.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2020.3037276</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-2581-935X</orcidid><orcidid>https://orcid.org/0000-0002-2735-1842</orcidid><orcidid>https://orcid.org/0000000227351842</orcidid><orcidid>https://orcid.org/000000022581935X</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2020, Vol.8, p.211951-211960
issn	2169-3536 2169-3536
language	eng
recordid	cdi_ieee_primary_9256259
source	IEEE Open Access Journals; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals
subjects	Algorithms Approximation algorithms Computer architecture conditioning Control tasks generalization Jacobian matrices Machine learning neural networks Optimization policy optimization Questions Regularization Reinforcement learning Shape Supervised learning Task analysis
title	Exploring and Exploiting Conditioning of Reinforcement Learning Agents
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-10T04%3A26%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Exploring%20and%20Exploiting%20Conditioning%20of%20Reinforcement%20Learning%20Agents&rft.jtitle=IEEE%20access&rft.au=Asadulaev,%20Arip&rft.date=2020&rft.volume=8&rft.spage=211951&rft.epage=211960&rft.pages=211951-211960&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2020.3037276&rft_dat=%3Cproquest_ieee_%3E2468751374%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2468751374&rft_id=info:pmid/&rft_ieee_id=9256259&rft_doaj_id=oai_doaj_org_article_6d8ccc5c5ba14f948b2d50bf8e626e75&rfr_iscdi=true