Absolute Policy Optimization

In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhao, Weiye, Li, Feihan, Sun, Yifan, Chen, Rui, Wei, Tianhao, Liu, Changliu
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Learning Computer Science - Robotics
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhao, Weiye Li, Feihan Sun, Yifan Chen, Rui Wei, Tianhao Liu, Changliu
description	In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Building upon this groundbreaking theoretical advancement, we further introduce a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO as well as its efficient variation Proximal Absolute Policy Optimization (PAPO) significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in worst-case performance, as well as expected performance.
doi_str_mv	10.48550/arxiv.2310.13230
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_13230</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_13230</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-95ff574fb2e5b679a1f5c5926bde241f8f6f2e78c84949de51c27ca29fef21273</originalsourceid><addsrcrecordid>eNotzksKwjAUheFMHIi6AEGwG2htbpImGUrxBUIddF6SmAuB1kqtoq7e5-jAPzh8hExpmnAlRLow3T3cEmDvQBmwdEhmS3tp62vvo0NbB_eIinMfmvA0fWhPYzJAU1_85L8jUq5XZb6N98Vmly_3sclkGmuBKCRHC17YTGpDUTihIbNHD5yiwgzBS-UU11wfvaAOpDOg0SNQkGxE5r_bL686d6Ex3aP6MKsvk70AvlM20Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Absolute Policy Optimization</title><source>arXiv.org</source><creator>Zhao, Weiye ; Li, Feihan ; Sun, Yifan ; Chen, Rui ; Wei, Tianhao ; Liu, Changliu</creator><creatorcontrib>Zhao, Weiye ; Li, Feihan ; Sun, Yifan ; Chen, Rui ; Wei, Tianhao ; Liu, Changliu</creatorcontrib><description>In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Building upon this groundbreaking theoretical advancement, we further introduce a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO as well as its efficient variation Proximal Absolute Policy Optimization (PAPO) significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in worst-case performance, as well as expected performance.</description><identifier>DOI: 10.48550/arxiv.2310.13230</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning ; Computer Science - Robotics</subject><creationdate>2023-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.13230$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.13230$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhao, Weiye</creatorcontrib><creatorcontrib>Li, Feihan</creatorcontrib><creatorcontrib>Sun, Yifan</creatorcontrib><creatorcontrib>Chen, Rui</creatorcontrib><creatorcontrib>Wei, Tianhao</creatorcontrib><creatorcontrib>Liu, Changliu</creatorcontrib><title>Absolute Policy Optimization</title><description>In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Building upon this groundbreaking theoretical advancement, we further introduce a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO as well as its efficient variation Proximal Absolute Policy Optimization (PAPO) significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in worst-case performance, as well as expected performance.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Robotics</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzksKwjAUheFMHIi6AEGwG2htbpImGUrxBUIddF6SmAuB1kqtoq7e5-jAPzh8hExpmnAlRLow3T3cEmDvQBmwdEhmS3tp62vvo0NbB_eIinMfmvA0fWhPYzJAU1_85L8jUq5XZb6N98Vmly_3sclkGmuBKCRHC17YTGpDUTihIbNHD5yiwgzBS-UU11wfvaAOpDOg0SNQkGxE5r_bL686d6Ex3aP6MKsvk70AvlM20Q</recordid><startdate>20231019</startdate><enddate>20231019</enddate><creator>Zhao, Weiye</creator><creator>Li, Feihan</creator><creator>Sun, Yifan</creator><creator>Chen, Rui</creator><creator>Wei, Tianhao</creator><creator>Liu, Changliu</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231019</creationdate><title>Absolute Policy Optimization</title><author>Zhao, Weiye ; Li, Feihan ; Sun, Yifan ; Chen, Rui ; Wei, Tianhao ; Liu, Changliu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-95ff574fb2e5b679a1f5c5926bde241f8f6f2e78c84949de51c27ca29fef21273</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Robotics</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Weiye</creatorcontrib><creatorcontrib>Li, Feihan</creatorcontrib><creatorcontrib>Sun, Yifan</creatorcontrib><creatorcontrib>Chen, Rui</creatorcontrib><creatorcontrib>Wei, Tianhao</creatorcontrib><creatorcontrib>Liu, Changliu</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhao, Weiye</au><au>Li, Feihan</au><au>Sun, Yifan</au><au>Chen, Rui</au><au>Wei, Tianhao</au><au>Liu, Changliu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Absolute Policy Optimization</atitle><date>2023-10-19</date><risdate>2023</risdate><abstract>In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Building upon this groundbreaking theoretical advancement, we further introduce a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO as well as its efficient variation Proximal Absolute Policy Optimization (PAPO) significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in worst-case performance, as well as expected performance.</abstract><doi>10.48550/arxiv.2310.13230</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2310.13230
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2310_13230
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Learning Computer Science - Robotics
title	Absolute Policy Optimization
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T05%3A05%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Absolute%20Policy%20Optimization&rft.au=Zhao,%20Weiye&rft.date=2023-10-19&rft_id=info:doi/10.48550/arxiv.2310.13230&rft_dat=%3Carxiv_GOX%3E2310_13230%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true