Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners

We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^*,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^*$, with prob...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zaki, Mohammadi, Mohan, Avi, Gopalan, Aditya
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Statistics - Machine Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zaki, Mohammadi Mohan, Avi Gopalan, Aditya
description	We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^$, with probability at least $1-\delta,$ using noisy measurements of the form $x^T\theta^.$ For this fixed confidence ($\delta$-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the \textit{Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^$ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players' strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.
doi_str_mv	10.48550/arxiv.2006.07562
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2006_07562</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2006_07562</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-e0f3e2ddb76fcf44f6a24883f901f7c2af251b03d5ee02540fdcff1f59f5ce1b3</originalsourceid><addsrcrecordid>eNotz81KAzEUhuFsXEj1AlyZG5gx_zNdtqVqcVCQuh4yyTnlQJuWJEi9e7W6-hYvfPAwdidFa3prxYPPZ_pslRCuFZ116pq9rM-nPQWqfAml8kU-8E2EVAkp-ErHxCnxgRL4zJc-RaqFfxRKO_56bN5hl6Hy4ScmyOWGXaHfF7j93xnbPq63q-dmeHvarBZD412nGhCoQcU4dQ4DGoPOK9P3GudCYheUR2XlJHS0AEJZIzAGRIl2jjaAnPSM3f_dXjTjKdPB56_xVzVeVPobb5ZILQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners</title><source>arXiv.org</source><creator>Zaki, Mohammadi ; Mohan, Avi ; Gopalan, Aditya</creator><creatorcontrib>Zaki, Mohammadi ; Mohan, Avi ; Gopalan, Aditya</creatorcontrib><description>We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^$, with probability at least $1-\delta,$ using noisy measurements of the form $x^T\theta^.$ For this fixed confidence ($\delta$-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the \textit{Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^$ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players' strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.</description><identifier>DOI: 10.48550/arxiv.2006.07562</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2020-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,883</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2006.07562$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2006.07562$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zaki, Mohammadi</creatorcontrib><creatorcontrib>Mohan, Avi</creatorcontrib><creatorcontrib>Gopalan, Aditya</creatorcontrib><title>Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners</title><description>We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^$, with probability at least $1-\delta,$ using noisy measurements of the form $x^T\theta^.$ For this fixed confidence ($\delta$-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the \textit{Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^$ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players' strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81KAzEUhuFsXEj1AlyZG5gx_zNdtqVqcVCQuh4yyTnlQJuWJEi9e7W6-hYvfPAwdidFa3prxYPPZ_pslRCuFZ116pq9rM-nPQWqfAml8kU-8E2EVAkp-ErHxCnxgRL4zJc-RaqFfxRKO_56bN5hl6Hy4ScmyOWGXaHfF7j93xnbPq63q-dmeHvarBZD412nGhCoQcU4dQ4DGoPOK9P3GudCYheUR2XlJHS0AEJZIzAGRIl2jjaAnPSM3f_dXjTjKdPB56_xVzVeVPobb5ZILQ</recordid><startdate>20200613</startdate><enddate>20200613</enddate><creator>Zaki, Mohammadi</creator><creator>Mohan, Avi</creator><creator>Gopalan, Aditya</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20200613</creationdate><title>Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners</title><author>Zaki, Mohammadi ; Mohan, Avi ; Gopalan, Aditya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-e0f3e2ddb76fcf44f6a24883f901f7c2af251b03d5ee02540fdcff1f59f5ce1b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Zaki, Mohammadi</creatorcontrib><creatorcontrib>Mohan, Avi</creatorcontrib><creatorcontrib>Gopalan, Aditya</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zaki, Mohammadi</au><au>Mohan, Avi</au><au>Gopalan, Aditya</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners</atitle><date>2020-06-13</date><risdate>2020</risdate><abstract>We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $\delta$ and an unknown vector $\theta^,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^T\theta^$, with probability at least $1-\delta,$ using noisy measurements of the form $x^T\theta^.$ For this fixed confidence ($\delta$-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the \textit{Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^$ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players' strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.</abstract><doi>10.48550/arxiv.2006.07562</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2006.07562
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2006_07562
source	arXiv.org
subjects	Computer Science - Learning Statistics - Machine Learning
title	Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T08%3A09%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Explicit%20Best%20Arm%20Identification%20in%20Linear%20Bandits%20Using%20No-Regret%20Learners&rft.au=Zaki,%20Mohammadi&rft.date=2020-06-13&rft_id=info:doi/10.48550/arxiv.2006.07562&rft_dat=%3Carxiv_GOX%3E2006_07562%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true