Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners

We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^*,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^*$, with prob...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zaki, Mohammadi, Mohan, Avi, Gopalan, Aditya
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Zaki, Mohammadi
Mohan, Avi
Gopalan, Aditya
description We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^*,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^*$, with probability at least $1-\delta,$ using noisy measurements of the form $x^T\theta^*.$ For this fixed confidence ($\delta$-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the \textit{Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^*$ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players' strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.
doi_str_mv 10.48550/arxiv.2006.07562
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2006_07562</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2006_07562</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-e0f3e2ddb76fcf44f6a24883f901f7c2af251b03d5ee02540fdcff1f59f5ce1b3</originalsourceid><addsrcrecordid>eNotz81KAzEUhuFsXEj1AlyZG5gx_zNdtqVqcVCQuh4yyTnlQJuWJEi9e7W6-hYvfPAwdidFa3prxYPPZ_pslRCuFZ116pq9rM-nPQWqfAml8kU-8E2EVAkp-ErHxCnxgRL4zJc-RaqFfxRKO_56bN5hl6Hy4ScmyOWGXaHfF7j93xnbPq63q-dmeHvarBZD412nGhCoQcU4dQ4DGoPOK9P3GudCYheUR2XlJHS0AEJZIzAGRIl2jjaAnPSM3f_dXjTjKdPB56_xVzVeVPobb5ZILQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners</title><source>arXiv.org</source><creator>Zaki, Mohammadi ; Mohan, Avi ; Gopalan, Aditya</creator><creatorcontrib>Zaki, Mohammadi ; Mohan, Avi ; Gopalan, Aditya</creatorcontrib><description>We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^*,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^*$, with probability at least $1-\delta,$ using noisy measurements of the form $x^T\theta^*.$ For this fixed confidence ($\delta$-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the \textit{Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^*$ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players' strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.</description><identifier>DOI: 10.48550/arxiv.2006.07562</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2020-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,883</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2006.07562$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2006.07562$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zaki, Mohammadi</creatorcontrib><creatorcontrib>Mohan, Avi</creatorcontrib><creatorcontrib>Gopalan, Aditya</creatorcontrib><title>Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners</title><description>We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^*,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^*$, with probability at least $1-\delta,$ using noisy measurements of the form $x^T\theta^*.$ For this fixed confidence ($\delta$-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the \textit{Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^*$ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players' strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81KAzEUhuFsXEj1AlyZG5gx_zNdtqVqcVCQuh4yyTnlQJuWJEi9e7W6-hYvfPAwdidFa3prxYPPZ_pslRCuFZ116pq9rM-nPQWqfAml8kU-8E2EVAkp-ErHxCnxgRL4zJc-RaqFfxRKO_56bN5hl6Hy4ScmyOWGXaHfF7j93xnbPq63q-dmeHvarBZD412nGhCoQcU4dQ4DGoPOK9P3GudCYheUR2XlJHS0AEJZIzAGRIl2jjaAnPSM3f_dXjTjKdPB56_xVzVeVPobb5ZILQ</recordid><startdate>20200613</startdate><enddate>20200613</enddate><creator>Zaki, Mohammadi</creator><creator>Mohan, Avi</creator><creator>Gopalan, Aditya</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20200613</creationdate><title>Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners</title><author>Zaki, Mohammadi ; Mohan, Avi ; Gopalan, Aditya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-e0f3e2ddb76fcf44f6a24883f901f7c2af251b03d5ee02540fdcff1f59f5ce1b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Zaki, Mohammadi</creatorcontrib><creatorcontrib>Mohan, Avi</creatorcontrib><creatorcontrib>Gopalan, Aditya</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zaki, Mohammadi</au><au>Mohan, Avi</au><au>Gopalan, Aditya</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners</atitle><date>2020-06-13</date><risdate>2020</risdate><abstract>We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^*,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^*$, with probability at least $1-\delta,$ using noisy measurements of the form $x^T\theta^*.$ For this fixed confidence ($\delta$-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the \textit{Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^*$ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players' strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.</abstract><doi>10.48550/arxiv.2006.07562</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2006.07562
ispartof
issn
language eng
recordid cdi_arxiv_primary_2006_07562
source arXiv.org
subjects Computer Science - Learning
Statistics - Machine Learning
title Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T08%3A09%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Explicit%20Best%20Arm%20Identification%20in%20Linear%20Bandits%20Using%20No-Regret%20Learners&rft.au=Zaki,%20Mohammadi&rft.date=2020-06-13&rft_id=info:doi/10.48550/arxiv.2006.07562&rft_dat=%3Carxiv_GOX%3E2006_07562%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true