Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs

In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and lin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Cheng, Yuan, Huang, Ruiquan, Yang, Jing, Liang, Yingbin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Statistics - Machine Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Cheng, Yuan Huang, Ruiquan Yang, Jing Liang, Yingbin
description	In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.
doi_str_mv	10.48550/arxiv.2303.10859
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2303_10859</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2303_10859</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-5981d2607cbbeb28185ab0b860fe5647a7017839f6916c8467f50a97df7d74a83</originalsourceid><addsrcrecordid>eNotj8lOwzAURb1hgQofwAr_gIMdx9MShalSKhCUdfQcP6OIxonc0uHvaQurc3UXRzqE3AheVFYpfgd532-LUnJZCG6VuySf82HK4xYD_YBhWiGtxxP2_eZA45jpO-4gBxYz4nH36fh1OGDa0AYhpz590Z8UMNNm3LEM6ZsuHt7WV-QiwmqN1_-ckeXT47J-Yc3r87y-bxho45hyVoRSc9N5j760wirw3FvNIypdGTBcGCtd1E7ozlbaRMXBmRBNMBVYOSO3f9pzVzvlfoB8aE997blP_gLV_kqx</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs</title><source>arXiv.org</source><creator>Cheng, Yuan ; Huang, Ruiquan ; Yang, Jing ; Liang, Yingbin</creator><creatorcontrib>Cheng, Yuan ; Huang, Ruiquan ; Yang, Jing ; Liang, Yingbin</creatorcontrib><description>In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.</description><identifier>DOI: 10.48550/arxiv.2303.10859</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2023-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2303.10859$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2303.10859$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Cheng, Yuan</creatorcontrib><creatorcontrib>Huang, Ruiquan</creatorcontrib><creatorcontrib>Yang, Jing</creatorcontrib><creatorcontrib>Liang, Yingbin</creatorcontrib><title>Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs</title><description>In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8lOwzAURb1hgQofwAr_gIMdx9MShalSKhCUdfQcP6OIxonc0uHvaQurc3UXRzqE3AheVFYpfgd532-LUnJZCG6VuySf82HK4xYD_YBhWiGtxxP2_eZA45jpO-4gBxYz4nH36fh1OGDa0AYhpz590Z8UMNNm3LEM6ZsuHt7WV-QiwmqN1_-ckeXT47J-Yc3r87y-bxho45hyVoRSc9N5j760wirw3FvNIypdGTBcGCtd1E7ozlbaRMXBmRBNMBVYOSO3f9pzVzvlfoB8aE997blP_gLV_kqx</recordid><startdate>20230320</startdate><enddate>20230320</enddate><creator>Cheng, Yuan</creator><creator>Huang, Ruiquan</creator><creator>Yang, Jing</creator><creator>Liang, Yingbin</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20230320</creationdate><title>Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs</title><author>Cheng, Yuan ; Huang, Ruiquan ; Yang, Jing ; Liang, Yingbin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-5981d2607cbbeb28185ab0b860fe5647a7017839f6916c8467f50a97df7d74a83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Cheng, Yuan</creatorcontrib><creatorcontrib>Huang, Ruiquan</creatorcontrib><creatorcontrib>Yang, Jing</creatorcontrib><creatorcontrib>Liang, Yingbin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cheng, Yuan</au><au>Huang, Ruiquan</au><au>Yang, Jing</au><au>Liang, Yingbin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs</atitle><date>2023-03-20</date><risdate>2023</risdate><abstract>In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.</abstract><doi>10.48550/arxiv.2303.10859</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2303.10859
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2303_10859
source	arXiv.org
subjects	Computer Science - Learning Statistics - Machine Learning
title	Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T22%3A30%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improved%20Sample%20Complexity%20for%20Reward-free%20Reinforcement%20Learning%20under%20Low-rank%20MDPs&rft.au=Cheng,%20Yuan&rft.date=2023-03-20&rft_id=info:doi/10.48550/arxiv.2303.10859&rft_dat=%3Carxiv_GOX%3E2303_10859%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true