Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs

In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and lin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Cheng, Yuan, Huang, Ruiquan, Yang, Jing, Liang, Yingbin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Cheng, Yuan
Huang, Ruiquan
Yang, Jing
Liang, Yingbin
description In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.
doi_str_mv 10.48550/arxiv.2303.10859
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2303_10859</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2303_10859</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-5981d2607cbbeb28185ab0b860fe5647a7017839f6916c8467f50a97df7d74a83</originalsourceid><addsrcrecordid>eNotj8lOwzAURb1hgQofwAr_gIMdx9MShalSKhCUdfQcP6OIxonc0uHvaQurc3UXRzqE3AheVFYpfgd532-LUnJZCG6VuySf82HK4xYD_YBhWiGtxxP2_eZA45jpO-4gBxYz4nH36fh1OGDa0AYhpz590Z8UMNNm3LEM6ZsuHt7WV-QiwmqN1_-ckeXT47J-Yc3r87y-bxho45hyVoRSc9N5j760wirw3FvNIypdGTBcGCtd1E7ozlbaRMXBmRBNMBVYOSO3f9pzVzvlfoB8aE997blP_gLV_kqx</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs</title><source>arXiv.org</source><creator>Cheng, Yuan ; Huang, Ruiquan ; Yang, Jing ; Liang, Yingbin</creator><creatorcontrib>Cheng, Yuan ; Huang, Ruiquan ; Yang, Jing ; Liang, Yingbin</creatorcontrib><description>In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.</description><identifier>DOI: 10.48550/arxiv.2303.10859</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2023-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2303.10859$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2303.10859$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Cheng, Yuan</creatorcontrib><creatorcontrib>Huang, Ruiquan</creatorcontrib><creatorcontrib>Yang, Jing</creatorcontrib><creatorcontrib>Liang, Yingbin</creatorcontrib><title>Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs</title><description>In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8lOwzAURb1hgQofwAr_gIMdx9MShalSKhCUdfQcP6OIxonc0uHvaQurc3UXRzqE3AheVFYpfgd532-LUnJZCG6VuySf82HK4xYD_YBhWiGtxxP2_eZA45jpO-4gBxYz4nH36fh1OGDa0AYhpz590Z8UMNNm3LEM6ZsuHt7WV-QiwmqN1_-ckeXT47J-Yc3r87y-bxho45hyVoRSc9N5j760wirw3FvNIypdGTBcGCtd1E7ozlbaRMXBmRBNMBVYOSO3f9pzVzvlfoB8aE997blP_gLV_kqx</recordid><startdate>20230320</startdate><enddate>20230320</enddate><creator>Cheng, Yuan</creator><creator>Huang, Ruiquan</creator><creator>Yang, Jing</creator><creator>Liang, Yingbin</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20230320</creationdate><title>Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs</title><author>Cheng, Yuan ; Huang, Ruiquan ; Yang, Jing ; Liang, Yingbin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-5981d2607cbbeb28185ab0b860fe5647a7017839f6916c8467f50a97df7d74a83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Cheng, Yuan</creatorcontrib><creatorcontrib>Huang, Ruiquan</creatorcontrib><creatorcontrib>Yang, Jing</creatorcontrib><creatorcontrib>Liang, Yingbin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cheng, Yuan</au><au>Huang, Ruiquan</au><au>Yang, Jing</au><au>Liang, Yingbin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs</atitle><date>2023-03-20</date><risdate>2023</risdate><abstract>In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.</abstract><doi>10.48550/arxiv.2303.10859</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2303.10859
ispartof
issn
language eng
recordid cdi_arxiv_primary_2303_10859
source arXiv.org
subjects Computer Science - Learning
Statistics - Machine Learning
title Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T22%3A30%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improved%20Sample%20Complexity%20for%20Reward-free%20Reinforcement%20Learning%20under%20Low-rank%20MDPs&rft.au=Cheng,%20Yuan&rft.date=2023-03-20&rft_id=info:doi/10.48550/arxiv.2303.10859&rft_dat=%3Carxiv_GOX%3E2303_10859%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true