An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support

Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Cowan, Wesley, Katehakis, Michael N
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Cowan, Wesley
Katehakis, Michael N
description Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for each fixed $i$, $\{ X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. uniform random variables over some interval $[a_i, b_i]$, with the support (i.e., $a_i, b_i$) unknown to the controller. The objective is to have a policy $\pi$ for deciding, based on available data, from which of the $N$ populations to sample from at any time $n=1,2,\ldots$ so as to maximize the expected sum of outcomes of $n$ samples or equivalently to minimize the regret due to lack on information of the parameters $\{ a_i \}$ and $\{ b_i \}$. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.
doi_str_mv 10.48550/arxiv.1505.01918
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1505_01918</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1505_01918</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-2062cbba08f88cf6f9a96518d5defc0f5f9c537a88fd80b91a2ec496344ffbc93</originalsourceid><addsrcrecordid>eNotj81KxDAURrNxIaMP4Mq8QGvSNplkWQcdhYERHNflNu2FYJqENP707a2jqwPf4uMcQm44KxslBLuD9G0_Sy6YKBnXXF2SfetpOy9TzCFbA84t9BizncDRl-CsWSiGRN-8XTHRe_CDzTMNuE7vPnx5-voRY0j5ilwguHm8_ueGnB4fTrun4nDcP-_aQwFyq4qKycr0PTCFShmUqEFLwdUghhENQ4HaiHoLSuGgWK85VKNptKybBrE3ut6Q27_bc0kX02qalu63qDsX1T-dhkdT</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support</title><source>arXiv.org</source><creator>Cowan, Wesley ; Katehakis, Michael N</creator><creatorcontrib>Cowan, Wesley ; Katehakis, Michael N</creatorcontrib><description>Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for each fixed $i$, $\{ X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. uniform random variables over some interval $[a_i, b_i]$, with the support (i.e., $a_i, b_i$) unknown to the controller. The objective is to have a policy $\pi$ for deciding, based on available data, from which of the $N$ populations to sample from at any time $n=1,2,\ldots$ so as to maximize the expected sum of outcomes of $n$ samples or equivalently to minimize the regret due to lack on information of the parameters $\{ a_i \}$ and $\{ b_i \}$. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.</description><identifier>DOI: 10.48550/arxiv.1505.01918</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2015-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1505.01918$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1505.01918$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Cowan, Wesley</creatorcontrib><creatorcontrib>Katehakis, Michael N</creatorcontrib><title>An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support</title><description>Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for each fixed $i$, $\{ X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. uniform random variables over some interval $[a_i, b_i]$, with the support (i.e., $a_i, b_i$) unknown to the controller. The objective is to have a policy $\pi$ for deciding, based on available data, from which of the $N$ populations to sample from at any time $n=1,2,\ldots$ so as to maximize the expected sum of outcomes of $n$ samples or equivalently to minimize the regret due to lack on information of the parameters $\{ a_i \}$ and $\{ b_i \}$. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAURrNxIaMP4Mq8QGvSNplkWQcdhYERHNflNu2FYJqENP707a2jqwPf4uMcQm44KxslBLuD9G0_Sy6YKBnXXF2SfetpOy9TzCFbA84t9BizncDRl-CsWSiGRN-8XTHRe_CDzTMNuE7vPnx5-voRY0j5ilwguHm8_ueGnB4fTrun4nDcP-_aQwFyq4qKycr0PTCFShmUqEFLwdUghhENQ4HaiHoLSuGgWK85VKNptKybBrE3ut6Q27_bc0kX02qalu63qDsX1T-dhkdT</recordid><startdate>20150507</startdate><enddate>20150507</enddate><creator>Cowan, Wesley</creator><creator>Katehakis, Michael N</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20150507</creationdate><title>An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support</title><author>Cowan, Wesley ; Katehakis, Michael N</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-2062cbba08f88cf6f9a96518d5defc0f5f9c537a88fd80b91a2ec496344ffbc93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Cowan, Wesley</creatorcontrib><creatorcontrib>Katehakis, Michael N</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cowan, Wesley</au><au>Katehakis, Michael N</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support</atitle><date>2015-05-07</date><risdate>2015</risdate><abstract>Consider the problem of a controller sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for each fixed $i$, $\{ X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. uniform random variables over some interval $[a_i, b_i]$, with the support (i.e., $a_i, b_i$) unknown to the controller. The objective is to have a policy $\pi$ for deciding, based on available data, from which of the $N$ populations to sample from at any time $n=1,2,\ldots$ so as to maximize the expected sum of outcomes of $n$ samples or equivalently to minimize the regret due to lack on information of the parameters $\{ a_i \}$ and $\{ b_i \}$. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.</abstract><doi>10.48550/arxiv.1505.01918</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.1505.01918
ispartof
issn
language eng
recordid cdi_arxiv_primary_1505_01918
source arXiv.org
subjects Computer Science - Learning
Statistics - Machine Learning
title An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T17%3A35%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Asymptotically%20Optimal%20Policy%20for%20Uniform%20Bandits%20of%20Unknown%20Support&rft.au=Cowan,%20Wesley&rft.date=2015-05-07&rft_id=info:doi/10.48550/arxiv.1505.01918&rft_dat=%3Carxiv_GOX%3E1505_01918%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true