Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards

In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards bu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-10
Hauptverfasser: Jiang, Zhaohui, Feng, Xuening, Weng, Paul, Zhu, Yifei, Song, Yan, Zhou, Tianze, Hu, Yujing, Lv, Tangjie, Fan, Changjie
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Jiang, Zhaohui
Feng, Xuening
Weng, Paul
Zhu, Yifei
Song, Yan
Zhou, Tianze
Hu, Yujing
Lv, Tangjie
Fan, Changjie
description In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler's preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human preferences and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy reward.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3115224453</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3115224453</sourcerecordid><originalsourceid>FETCH-proquest_journals_31152244533</originalsourceid><addsrcrecordid>eNqNyssKglAQgOFDECTlOwy0FvRcqq1IUtEipL2IjqHkHJujXd4-Fz1Aq2_x_zPhSaWiYKelXAjfuTYMQ7nZSmOUJ04ZNlRbLrFDGuCMBVNDN0jZdnDseuQaywESyzzZPBHiCUsOYqrgwvb9gQxfBVduJeZ1cXfo_1yKdbq_JoegZ_sY0Q15a0emKeUqioyUWhul_ru-0rM8sw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3115224453</pqid></control><display><type>article</type><title>Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards</title><source>Free E- Journals</source><creator>Jiang, Zhaohui ; Feng, Xuening ; Weng, Paul ; Zhu, Yifei ; Song, Yan ; Zhou, Tianze ; Hu, Yujing ; Lv, Tangjie ; Fan, Changjie</creator><creatorcontrib>Jiang, Zhaohui ; Feng, Xuening ; Weng, Paul ; Zhu, Yifei ; Song, Yan ; Zhou, Tianze ; Hu, Yujing ; Lv, Tangjie ; Fan, Changjie</creatorcontrib><description>In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler's preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human preferences and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy reward.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Defects ; Design standards ; Feedback ; Machine learning ; Optimization</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Jiang, Zhaohui</creatorcontrib><creatorcontrib>Feng, Xuening</creatorcontrib><creatorcontrib>Weng, Paul</creatorcontrib><creatorcontrib>Zhu, Yifei</creatorcontrib><creatorcontrib>Song, Yan</creatorcontrib><creatorcontrib>Zhou, Tianze</creatorcontrib><creatorcontrib>Hu, Yujing</creatorcontrib><creatorcontrib>Lv, Tangjie</creatorcontrib><creatorcontrib>Fan, Changjie</creatorcontrib><title>Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards</title><title>arXiv.org</title><description>In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler's preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human preferences and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy reward.</description><subject>Algorithms</subject><subject>Defects</subject><subject>Design standards</subject><subject>Feedback</subject><subject>Machine learning</subject><subject>Optimization</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNyssKglAQgOFDECTlOwy0FvRcqq1IUtEipL2IjqHkHJujXd4-Fz1Aq2_x_zPhSaWiYKelXAjfuTYMQ7nZSmOUJ04ZNlRbLrFDGuCMBVNDN0jZdnDseuQaywESyzzZPBHiCUsOYqrgwvb9gQxfBVduJeZ1cXfo_1yKdbq_JoegZ_sY0Q15a0emKeUqioyUWhul_ru-0rM8sw</recordid><startdate>20241008</startdate><enddate>20241008</enddate><creator>Jiang, Zhaohui</creator><creator>Feng, Xuening</creator><creator>Weng, Paul</creator><creator>Zhu, Yifei</creator><creator>Song, Yan</creator><creator>Zhou, Tianze</creator><creator>Hu, Yujing</creator><creator>Lv, Tangjie</creator><creator>Fan, Changjie</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241008</creationdate><title>Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards</title><author>Jiang, Zhaohui ; Feng, Xuening ; Weng, Paul ; Zhu, Yifei ; Song, Yan ; Zhou, Tianze ; Hu, Yujing ; Lv, Tangjie ; Fan, Changjie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31152244533</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Defects</topic><topic>Design standards</topic><topic>Feedback</topic><topic>Machine learning</topic><topic>Optimization</topic><toplevel>online_resources</toplevel><creatorcontrib>Jiang, Zhaohui</creatorcontrib><creatorcontrib>Feng, Xuening</creatorcontrib><creatorcontrib>Weng, Paul</creatorcontrib><creatorcontrib>Zhu, Yifei</creatorcontrib><creatorcontrib>Song, Yan</creatorcontrib><creatorcontrib>Zhou, Tianze</creatorcontrib><creatorcontrib>Hu, Yujing</creatorcontrib><creatorcontrib>Lv, Tangjie</creatorcontrib><creatorcontrib>Fan, Changjie</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jiang, Zhaohui</au><au>Feng, Xuening</au><au>Weng, Paul</au><au>Zhu, Yifei</au><au>Song, Yan</au><au>Zhou, Tianze</au><au>Hu, Yujing</au><au>Lv, Tangjie</au><au>Fan, Changjie</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards</atitle><jtitle>arXiv.org</jtitle><date>2024-10-08</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler's preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human preferences and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy reward.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_3115224453
source Free E- Journals
subjects Algorithms
Defects
Design standards
Feedback
Machine learning
Optimization
title Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T11%3A42%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Reinforcement%20Learning%20From%20Imperfect%20Corrective%20Actions%20And%20Proxy%20Rewards&rft.jtitle=arXiv.org&rft.au=Jiang,%20Zhaohui&rft.date=2024-10-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3115224453%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3115224453&rft_id=info:pmid/&rfr_iscdi=true