An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support

Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Cowan, Wesley, Katehakis, Michael N
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Statistics - Machine Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Cowan, Wesley Katehakis, Michael N
description	Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for each fixed $i$, $\{ X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. uniform random variables over some interval $[a_i, b_i]$, with the support (i.e., $a_i, b_i$) unknown to the controller. The objective is to have a policy $\pi$ for deciding, based on available data, from which of the $N$ populations to sample from at any time $n=1,2,\ldots$ so as to maximize the expected sum of outcomes of $n$ samples or equivalently to minimize the regret due to lack on information of the parameters $\{ a_i \}$ and $\{ b_i \}$. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.
doi_str_mv	10.48550/arxiv.1505.01918
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1505_01918</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1505_01918</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-2062cbba08f88cf6f9a96518d5defc0f5f9c537a88fd80b91a2ec496344ffbc93</originalsourceid><addsrcrecordid>eNotj81KxDAURrNxIaMP4Mq8QGvSNplkWQcdhYERHNflNu2FYJqENP707a2jqwPf4uMcQm44KxslBLuD9G0_Sy6YKBnXXF2SfetpOy9TzCFbA84t9BizncDRl-CsWSiGRN-8XTHRe_CDzTMNuE7vPnx5-voRY0j5ilwguHm8_ueGnB4fTrun4nDcP-_aQwFyq4qKycr0PTCFShmUqEFLwdUghhENQ4HaiHoLSuGgWK85VKNptKybBrE3ut6Q27_bc0kX02qalu63qDsX1T-dhkdT</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support</title><source>arXiv.org</source><creator>Cowan, Wesley ; Katehakis, Michael N</creator><creatorcontrib>Cowan, Wesley ; Katehakis, Michael N</creatorcontrib><description>Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for each fixed $i$, $\{ X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. uniform random variables over some interval $[a_i, b_i]$, with the support (i.e., $a_i, b_i$) unknown to the controller. The objective is to have a policy $\pi$ for deciding, based on available data, from which of the $N$ populations to sample from at any time $n=1,2,\ldots$ so as to maximize the expected sum of outcomes of $n$ samples or equivalently to minimize the regret due to lack on information of the parameters $\{ a_i \}$ and $\{ b_i \}$. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.</description><identifier>DOI: 10.48550/arxiv.1505.01918</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2015-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1505.01918$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1505.01918$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Cowan, Wesley</creatorcontrib><creatorcontrib>Katehakis, Michael N</creatorcontrib><title>An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support</title><description>Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for each fixed $i$, $\{ X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. uniform random variables over some interval $[a_i, b_i]$, with the support (i.e., $a_i, b_i$) unknown to the controller. The objective is to have a policy $\pi$ for deciding, based on available data, from which of the $N$ populations to sample from at any time $n=1,2,\ldots$ so as to maximize the expected sum of outcomes of $n$ samples or equivalently to minimize the regret due to lack on information of the parameters $\{ a_i \}$ and $\{ b_i \}$. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAURrNxIaMP4Mq8QGvSNplkWQcdhYERHNflNu2FYJqENP707a2jqwPf4uMcQm44KxslBLuD9G0_Sy6YKBnXXF2SfetpOy9TzCFbA84t9BizncDRl-CsWSiGRN-8XTHRe_CDzTMNuE7vPnx5-voRY0j5ilwguHm8_ueGnB4fTrun4nDcP-_aQwFyq4qKycr0PTCFShmUqEFLwdUghhENQ4HaiHoLSuGgWK85VKNptKybBrE3ut6Q27_bc0kX02qalu63qDsX1T-dhkdT</recordid><startdate>20150507</startdate><enddate>20150507</enddate><creator>Cowan, Wesley</creator><creator>Katehakis, Michael N</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20150507</creationdate><title>An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support</title><author>Cowan, Wesley ; Katehakis, Michael N</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-2062cbba08f88cf6f9a96518d5defc0f5f9c537a88fd80b91a2ec496344ffbc93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Cowan, Wesley</creatorcontrib><creatorcontrib>Katehakis, Michael N</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cowan, Wesley</au><au>Katehakis, Michael N</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support</atitle><date>2015-05-07</date><risdate>2015</risdate><abstract>Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for each fixed $i$, $\{ X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. uniform random variables over some interval $[a_i, b_i]$, with the support (i.e., $a_i, b_i$) unknown to the controller. The objective is to have a policy $\pi$ for deciding, based on available data, from which of the $N$ populations to sample from at any time $n=1,2,\ldots$ so as to maximize the expected sum of outcomes of $n$ samples or equivalently to minimize the regret due to lack on information of the parameters $\{ a_i \}$ and $\{ b_i \}$. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.</abstract><doi>10.48550/arxiv.1505.01918</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.1505.01918
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_1505_01918
source	arXiv.org
subjects	Computer Science - Learning Statistics - Machine Learning
title	An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T17%3A35%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Asymptotically%20Optimal%20Policy%20for%20Uniform%20Bandits%20of%20Unknown%20Support&rft.au=Cowan,%20Wesley&rft.date=2015-05-07&rft_id=info:doi/10.48550/arxiv.1505.01918&rft_dat=%3Carxiv_GOX%3E1505_01918%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true