Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Tossou, Aristide, Dimitrakakis, Christos, Basu, Debabrota
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Science and Game Theory Computer Science - Learning Statistics - Machine Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Tossou, Aristide Dimitrakakis, Christos Basu, Debabrota
description	We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.
doi_str_mv	10.48550/arxiv.1906.09114
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1906_09114</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1906_09114</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-60a470ad3d728d3322ad39db84aedf297cb687da6000f0c5a9324b0b2119d5f13</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAr_QML1I3a8hJYCaqGVKOvoxnYkqyGu7FDo39MWVjOL0dEcQm4YlLKuKrjD9BP2JTOgSjCMyUuyePOYirgbwyf29AEPPgcc6Hvsv8YQBzqPiX4M2yF-D3QWsk1-9PQV0zbu6czbkE-jdYrW53xFLjrss7_-zwnZzB830-diuXp6md4vC1RaFgpQakAnnOa1E4LzYzeurSV613Gjbatq7VABQAe2QiO4bKHljBlXdUxMyO0f9mzT7NLxejo0J6vmbCV-AZ60R7A</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process</title><source>arXiv.org</source><creator>Tossou, Aristide ; Dimitrakakis, Christos ; Basu, Debabrota</creator><creatorcontrib>Tossou, Aristide ; Dimitrakakis, Christos ; Basu, Debabrota</creatorcontrib><description>We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.</description><identifier>DOI: 10.48550/arxiv.1906.09114</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Science and Game Theory ; Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2019-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1906.09114$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1906.09114$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Tossou, Aristide</creatorcontrib><creatorcontrib>Dimitrakakis, Christos</creatorcontrib><creatorcontrib>Basu, Debabrota</creatorcontrib><title>Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process</title><description>We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Science and Game Theory</subject><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAr_QML1I3a8hJYCaqGVKOvoxnYkqyGu7FDo39MWVjOL0dEcQm4YlLKuKrjD9BP2JTOgSjCMyUuyePOYirgbwyf29AEPPgcc6Hvsv8YQBzqPiX4M2yF-D3QWsk1-9PQV0zbu6czbkE-jdYrW53xFLjrss7_-zwnZzB830-diuXp6md4vC1RaFgpQakAnnOa1E4LzYzeurSV613Gjbatq7VABQAe2QiO4bKHljBlXdUxMyO0f9mzT7NLxejo0J6vmbCV-AZ60R7A</recordid><startdate>20190620</startdate><enddate>20190620</enddate><creator>Tossou, Aristide</creator><creator>Dimitrakakis, Christos</creator><creator>Basu, Debabrota</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20190620</creationdate><title>Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process</title><author>Tossou, Aristide ; Dimitrakakis, Christos ; Basu, Debabrota</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-60a470ad3d728d3322ad39db84aedf297cb687da6000f0c5a9324b0b2119d5f13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Science and Game Theory</topic><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Tossou, Aristide</creatorcontrib><creatorcontrib>Dimitrakakis, Christos</creatorcontrib><creatorcontrib>Basu, Debabrota</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tossou, Aristide</au><au>Dimitrakakis, Christos</au><au>Basu, Debabrota</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process</atitle><date>2019-06-20</date><risdate>2019</risdate><abstract>We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.</abstract><doi>10.48550/arxiv.1906.09114</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.1906.09114
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_1906_09114
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Science and Game Theory Computer Science - Learning Statistics - Machine Learning
title	Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T16%3A28%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Near-optimal%20Bayesian%20Solution%20For%20Unknown%20Discrete%20Markov%20Decision%20Process&rft.au=Tossou,%20Aristide&rft.date=2019-06-20&rft_id=info:doi/10.48550/arxiv.1906.09114&rft_dat=%3Carxiv_GOX%3E1906_09114%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true