Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes
We introduce and analyse two algorithms for exploration-exploitation in discrete and continuous Markov Decision Processes (MDPs) based on exploration bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs efficient exploration-exploitation in any unknown weakly-communicating MDP f...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Qian, Jian Fruit, Ronan Pirotta, Matteo Lazaric, Alessandro |
description | We introduce and analyse two algorithms for exploration-exploitation in
discrete and continuous Markov Decision Processes (MDPs) based on exploration
bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs
efficient exploration-exploitation in any unknown weakly-communicating MDP for
which an upper bound C on the span of the optimal bias function is known. For
an MDP with $S$ states, $A$ actions and $\Gamma \leq S$ possible next states,
we prove that SCAL$^+$ achieves the same theoretical guarantees as SCAL (i.e.,
a high probability regret bound of $\widetilde{O}(C\sqrt{\Gamma SAT})$), with a
much smaller computational complexity. Similarly, C-SCAL$^+$ exploits an
exploration bonus to achieve sublinear regret in any undiscounted MDP with
continuous state space. We show that C-SCAL$^+$ achieves the same regret bound
as UCCRL (Ortner and Ryabko, 2012) while being the first implementable
algorithm with regret guarantees in this setting. While optimistic algorithms
such as UCRL, SCAL or UCCRL maintain a high-confidence set of plausible MDPs
around the true unknown MDP, SCAL$^+$ and C-SCAL$^+$ leverage on an exploration
bonus to directly plan on the empirically estimated MDP, thus being more
computationally efficient. |
doi_str_mv | 10.48550/arxiv.1812.04363 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1812_04363</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1812_04363</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-2f18f325309b69835f31145b5607c3110eb8d6be28a4f005d134adc972c6ef2c3</originalsourceid><addsrcrecordid>eNotj0lOwzAYhb1hgQoHYIUvkOAhdpwlpGWQWoFQWUeO_RtZbe3KTqrC6UlbVu9Jb5A-hO4oKSslBHnQ6egPJVWUlaTikl-jzeK438akBx8DfophzNjFhD_hO8GAVz74nf-9pD7gr2B9NnEMA1g8n-xUAqyDxW0Mgw9jnPYrnTbxgOdgfD7tPlI0kDPkG3Tl9DbD7b_O0Pp5sW5fi-X7y1v7uCy0rHnBHFWOM8FJ08tGceE4pZXohSS1mSyBXlnZA1O6coQIS3mlrWlqZiQ4ZvgM3V9uz7TdPvmdTj_dibo7U_M_Z6dT0A</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes</title><source>arXiv.org</source><creator>Qian, Jian ; Fruit, Ronan ; Pirotta, Matteo ; Lazaric, Alessandro</creator><creatorcontrib>Qian, Jian ; Fruit, Ronan ; Pirotta, Matteo ; Lazaric, Alessandro</creatorcontrib><description>We introduce and analyse two algorithms for exploration-exploitation in
discrete and continuous Markov Decision Processes (MDPs) based on exploration
bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs
efficient exploration-exploitation in any unknown weakly-communicating MDP for
which an upper bound C on the span of the optimal bias function is known. For
an MDP with $S$ states, $A$ actions and $\Gamma \leq S$ possible next states,
we prove that SCAL$^+$ achieves the same theoretical guarantees as SCAL (i.e.,
a high probability regret bound of $\widetilde{O}(C\sqrt{\Gamma SAT})$), with a
much smaller computational complexity. Similarly, C-SCAL$^+$ exploits an
exploration bonus to achieve sublinear regret in any undiscounted MDP with
continuous state space. We show that C-SCAL$^+$ achieves the same regret bound
as UCCRL (Ortner and Ryabko, 2012) while being the first implementable
algorithm with regret guarantees in this setting. While optimistic algorithms
such as UCRL, SCAL or UCCRL maintain a high-confidence set of plausible MDPs
around the true unknown MDP, SCAL$^+$ and C-SCAL$^+$ leverage on an exploration
bonus to directly plan on the empirically estimated MDP, thus being more
computationally efficient.</description><identifier>DOI: 10.48550/arxiv.1812.04363</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2018-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1812.04363$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1812.04363$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Qian, Jian</creatorcontrib><creatorcontrib>Fruit, Ronan</creatorcontrib><creatorcontrib>Pirotta, Matteo</creatorcontrib><creatorcontrib>Lazaric, Alessandro</creatorcontrib><title>Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes</title><description>We introduce and analyse two algorithms for exploration-exploitation in
discrete and continuous Markov Decision Processes (MDPs) based on exploration
bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs
efficient exploration-exploitation in any unknown weakly-communicating MDP for
which an upper bound C on the span of the optimal bias function is known. For
an MDP with $S$ states, $A$ actions and $\Gamma \leq S$ possible next states,
we prove that SCAL$^+$ achieves the same theoretical guarantees as SCAL (i.e.,
a high probability regret bound of $\widetilde{O}(C\sqrt{\Gamma SAT})$), with a
much smaller computational complexity. Similarly, C-SCAL$^+$ exploits an
exploration bonus to achieve sublinear regret in any undiscounted MDP with
continuous state space. We show that C-SCAL$^+$ achieves the same regret bound
as UCCRL (Ortner and Ryabko, 2012) while being the first implementable
algorithm with regret guarantees in this setting. While optimistic algorithms
such as UCRL, SCAL or UCCRL maintain a high-confidence set of plausible MDPs
around the true unknown MDP, SCAL$^+$ and C-SCAL$^+$ leverage on an exploration
bonus to directly plan on the empirically estimated MDP, thus being more
computationally efficient.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj0lOwzAYhb1hgQoHYIUvkOAhdpwlpGWQWoFQWUeO_RtZbe3KTqrC6UlbVu9Jb5A-hO4oKSslBHnQ6egPJVWUlaTikl-jzeK438akBx8DfophzNjFhD_hO8GAVz74nf-9pD7gr2B9NnEMA1g8n-xUAqyDxW0Mgw9jnPYrnTbxgOdgfD7tPlI0kDPkG3Tl9DbD7b_O0Pp5sW5fi-X7y1v7uCy0rHnBHFWOM8FJ08tGceE4pZXohSS1mSyBXlnZA1O6coQIS3mlrWlqZiQ4ZvgM3V9uz7TdPvmdTj_dibo7U_M_Z6dT0A</recordid><startdate>20181211</startdate><enddate>20181211</enddate><creator>Qian, Jian</creator><creator>Fruit, Ronan</creator><creator>Pirotta, Matteo</creator><creator>Lazaric, Alessandro</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20181211</creationdate><title>Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes</title><author>Qian, Jian ; Fruit, Ronan ; Pirotta, Matteo ; Lazaric, Alessandro</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-2f18f325309b69835f31145b5607c3110eb8d6be28a4f005d134adc972c6ef2c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Qian, Jian</creatorcontrib><creatorcontrib>Fruit, Ronan</creatorcontrib><creatorcontrib>Pirotta, Matteo</creatorcontrib><creatorcontrib>Lazaric, Alessandro</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qian, Jian</au><au>Fruit, Ronan</au><au>Pirotta, Matteo</au><au>Lazaric, Alessandro</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes</atitle><date>2018-12-11</date><risdate>2018</risdate><abstract>We introduce and analyse two algorithms for exploration-exploitation in
discrete and continuous Markov Decision Processes (MDPs) based on exploration
bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs
efficient exploration-exploitation in any unknown weakly-communicating MDP for
which an upper bound C on the span of the optimal bias function is known. For
an MDP with $S$ states, $A$ actions and $\Gamma \leq S$ possible next states,
we prove that SCAL$^+$ achieves the same theoretical guarantees as SCAL (i.e.,
a high probability regret bound of $\widetilde{O}(C\sqrt{\Gamma SAT})$), with a
much smaller computational complexity. Similarly, C-SCAL$^+$ exploits an
exploration bonus to achieve sublinear regret in any undiscounted MDP with
continuous state space. We show that C-SCAL$^+$ achieves the same regret bound
as UCCRL (Ortner and Ryabko, 2012) while being the first implementable
algorithm with regret guarantees in this setting. While optimistic algorithms
such as UCRL, SCAL or UCCRL maintain a high-confidence set of plausible MDPs
around the true unknown MDP, SCAL$^+$ and C-SCAL$^+$ leverage on an exploration
bonus to directly plan on the empirically estimated MDP, thus being more
computationally efficient.</abstract><doi>10.48550/arxiv.1812.04363</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.1812.04363 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_1812_04363 |
source | arXiv.org |
subjects | Computer Science - Learning Statistics - Machine Learning |
title | Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T23%3A38%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Exploration%20Bonus%20for%20Regret%20Minimization%20in%20Undiscounted%20Discrete%20and%20Continuous%20Markov%20Decision%20Processes&rft.au=Qian,%20Jian&rft.date=2018-12-11&rft_id=info:doi/10.48550/arxiv.1812.04363&rft_dat=%3Carxiv_GOX%3E1812_04363%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |