Navigating to the Best Policy in Markov Decision Processes

We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Marjani, Aymen Al, Garivier, Aurélien, Proutiere, Alexandre
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Statistics - Machine Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Marjani, Aymen Al Garivier, Aurélien Proutiere, Alexandre
description	We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least $1-\delta$. We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the \emph{generative setting}~\cite{pmlr-v139-marjani21a}, where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the \emph{navigation constraints}, induced by the \emph{online setting}. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.
doi_str_mv	10.48550/arxiv.2106.02847
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2106_02847</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2106_02847</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-a4e7d6c627c1093409dfb42300b3cfdc0fac23a3fd3df6c2967a0036e2a785483</originalsourceid><addsrcrecordid>eNotj8tOwzAQAH3hgAofwAn_QMLG69gJNyhPqUAPvUfbtV0sSozsKKJ_jyic5jaaEeKigVp3bQtXlL_jXKsGTA2q0_ZUXL_SHHc0xXEnpySndy9vfZnkOu0jH2Qc5QvljzTLO8-xxDTKdU7sS_HlTJwE2hd__s-F2Dzcb5ZP1ert8Xl5s6rIWFuR9tYZNspyAz1q6F3YaoUAW-TgGAKxQsLg0AXDqjeWANB4RbZrdYcLcfmnPcYPXzl-Uj4MvxPDcQJ_APQAQTU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Navigating to the Best Policy in Markov Decision Processes</title><source>arXiv.org</source><creator>Marjani, Aymen Al ; Garivier, Aurélien ; Proutiere, Alexandre</creator><creatorcontrib>Marjani, Aymen Al ; Garivier, Aurélien ; Proutiere, Alexandre</creatorcontrib><description>We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least $1-\delta$. We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the \emph{generative setting}~\cite{pmlr-v139-marjani21a}, where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the \emph{navigation constraints}, induced by the \emph{online setting}. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.</description><identifier>DOI: 10.48550/arxiv.2106.02847</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2021-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2106.02847$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2106.02847$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Marjani, Aymen Al</creatorcontrib><creatorcontrib>Garivier, Aurélien</creatorcontrib><creatorcontrib>Proutiere, Alexandre</creatorcontrib><title>Navigating to the Best Policy in Markov Decision Processes</title><description>We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least $1-\delta$. We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the \emph{generative setting}~\cite{pmlr-v139-marjani21a}, where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the \emph{navigation constraints}, induced by the \emph{online setting}. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAQAH3hgAofwAn_QMLG69gJNyhPqUAPvUfbtV0sSozsKKJ_jyic5jaaEeKigVp3bQtXlL_jXKsGTA2q0_ZUXL_SHHc0xXEnpySndy9vfZnkOu0jH2Qc5QvljzTLO8-xxDTKdU7sS_HlTJwE2hd__s-F2Dzcb5ZP1ert8Xl5s6rIWFuR9tYZNspyAz1q6F3YaoUAW-TgGAKxQsLg0AXDqjeWANB4RbZrdYcLcfmnPcYPXzl-Uj4MvxPDcQJ_APQAQTU</recordid><startdate>20210605</startdate><enddate>20210605</enddate><creator>Marjani, Aymen Al</creator><creator>Garivier, Aurélien</creator><creator>Proutiere, Alexandre</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20210605</creationdate><title>Navigating to the Best Policy in Markov Decision Processes</title><author>Marjani, Aymen Al ; Garivier, Aurélien ; Proutiere, Alexandre</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-a4e7d6c627c1093409dfb42300b3cfdc0fac23a3fd3df6c2967a0036e2a785483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Marjani, Aymen Al</creatorcontrib><creatorcontrib>Garivier, Aurélien</creatorcontrib><creatorcontrib>Proutiere, Alexandre</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Marjani, Aymen Al</au><au>Garivier, Aurélien</au><au>Proutiere, Alexandre</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Navigating to the Best Policy in Markov Decision Processes</atitle><date>2021-06-05</date><risdate>2021</risdate><abstract>We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least $1-\delta$. We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the \emph{generative setting}~\cite{pmlr-v139-marjani21a}, where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the \emph{navigation constraints}, induced by the \emph{online setting}. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.</abstract><doi>10.48550/arxiv.2106.02847</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2106.02847
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2106_02847
source	arXiv.org
subjects	Computer Science - Learning Statistics - Machine Learning
title	Navigating to the Best Policy in Markov Decision Processes
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T18%3A42%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Navigating%20to%20the%20Best%20Policy%20in%20Markov%20Decision%20Processes&rft.au=Marjani,%20Aymen%20Al&rft.date=2021-06-05&rft_id=info:doi/10.48550/arxiv.2106.02847&rft_dat=%3Carxiv_GOX%3E2106_02847%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true