On the value of learning for Bernoulli bandits with unknown parameters
Investigates the multiarmed bandit problem, where each arm generates an infinite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be beta-distributed. Every time a bandit is selected, its beta-distribution is updated to n...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on automatic control 2000-11, Vol.45 (11), p.2135-2140 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 2140 |
---|---|
container_issue | 11 |
container_start_page | 2135 |
container_title | IEEE transactions on automatic control |
container_volume | 45 |
creator | Bhulai, S. Koole, G. |
description | Investigates the multiarmed bandit problem, where each arm generates an infinite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be beta-distributed. Every time a bandit is selected, its beta-distribution is updated to new information in a Bayesian way. The objective is to maximize the long-term discounted rewards. We study the relationship between the necessity of acquiring additional information and the reward. This is done by considering two extreme situations, which occur when a bandit has been played N times: the situation where the decision maker stops learning and the situation where the decision maker acquires full information about that bandit. We show that the difference in reward between this lower and upper bound goes to zero as N grows large. |
doi_str_mv | 10.1109/9.887641 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_28451601</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>887641</ieee_id><sourcerecordid>28985090</sourcerecordid><originalsourceid>FETCH-LOGICAL-c434t-4e7ef075957809607b3e25e369fa6d9eeb56b41630241046738720d655651fa63</originalsourceid><addsrcrecordid>eNqN0T1LA0EQBuBFFIxRsLZaLMTm4n5_lBqMCkIarY-9ZM5svOzG3TuD_96TBAULsRqGeXhheBE6pWREKbFXdmSMVoLuoQGV0hRMMr6PBoRQU1hm1CE6ynnZr0oIOkCTacDtAvC7azrAscYNuBR8eMF1TPgGUohd03hcuTD3bcYb3y5wF15D3AS8dsmtoIWUj9FB7ZoMJ7s5RM-T26fxffE4vXsYXz8WM8FFWwjQUBMtrdSGWEV0xYFJ4MrWTs0tQCVVJajihAlKhNLcaEbmSkolaU_4EF1sc9cpvnWQ23Ll8wyaxgWIXS6ZsUYSS_4BhaSK0B5e_gkp4ZRZbbTu6fkvuoxdCv2_pTHCcCqF_smbpZhzgrpcJ79y6aNPKr8aKm25bainZ1vqAeCb7Y6fP6-IAw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>884831547</pqid></control><display><type>article</type><title>On the value of learning for Bernoulli bandits with unknown parameters</title><source>IEEE Electronic Library (IEL)</source><creator>Bhulai, S. ; Koole, G.</creator><creatorcontrib>Bhulai, S. ; Koole, G.</creatorcontrib><description>Investigates the multiarmed bandit problem, where each arm generates an infinite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be beta-distributed. Every time a bandit is selected, its beta-distribution is updated to new information in a Bayesian way. The objective is to maximize the long-term discounted rewards. We study the relationship between the necessity of acquiring additional information and the reward. This is done by considering two extreme situations, which occur when a bandit has been played N times: the situation where the decision maker stops learning and the situation where the decision maker acquires full information about that bandit. We show that the difference in reward between this lower and upper bound goes to zero as N grows large.</description><identifier>ISSN: 0018-9286</identifier><identifier>EISSN: 1558-2523</identifier><identifier>DOI: 10.1109/9.887641</identifier><identifier>CODEN: IETAA9</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Adaptive control ; Arm ; Automatic control ; Bayesian analysis ; Bayesian methods ; Clinical trials ; Closed-form solution ; Decision making ; Diseases ; Dynamic programming ; Equations ; Learning ; Minimax techniques ; Plugs ; Upper bound ; Upper bounds</subject><ispartof>IEEE transactions on automatic control, 2000-11, Vol.45 (11), p.2135-2140</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2000</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c434t-4e7ef075957809607b3e25e369fa6d9eeb56b41630241046738720d655651fa63</citedby><cites>FETCH-LOGICAL-c434t-4e7ef075957809607b3e25e369fa6d9eeb56b41630241046738720d655651fa63</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/887641$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/887641$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Bhulai, S.</creatorcontrib><creatorcontrib>Koole, G.</creatorcontrib><title>On the value of learning for Bernoulli bandits with unknown parameters</title><title>IEEE transactions on automatic control</title><addtitle>TAC</addtitle><description>Investigates the multiarmed bandit problem, where each arm generates an infinite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be beta-distributed. Every time a bandit is selected, its beta-distribution is updated to new information in a Bayesian way. The objective is to maximize the long-term discounted rewards. We study the relationship between the necessity of acquiring additional information and the reward. This is done by considering two extreme situations, which occur when a bandit has been played N times: the situation where the decision maker stops learning and the situation where the decision maker acquires full information about that bandit. We show that the difference in reward between this lower and upper bound goes to zero as N grows large.</description><subject>Adaptive control</subject><subject>Arm</subject><subject>Automatic control</subject><subject>Bayesian analysis</subject><subject>Bayesian methods</subject><subject>Clinical trials</subject><subject>Closed-form solution</subject><subject>Decision making</subject><subject>Diseases</subject><subject>Dynamic programming</subject><subject>Equations</subject><subject>Learning</subject><subject>Minimax techniques</subject><subject>Plugs</subject><subject>Upper bound</subject><subject>Upper bounds</subject><issn>0018-9286</issn><issn>1558-2523</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2000</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNqN0T1LA0EQBuBFFIxRsLZaLMTm4n5_lBqMCkIarY-9ZM5svOzG3TuD_96TBAULsRqGeXhheBE6pWREKbFXdmSMVoLuoQGV0hRMMr6PBoRQU1hm1CE6ynnZr0oIOkCTacDtAvC7azrAscYNuBR8eMF1TPgGUohd03hcuTD3bcYb3y5wF15D3AS8dsmtoIWUj9FB7ZoMJ7s5RM-T26fxffE4vXsYXz8WM8FFWwjQUBMtrdSGWEV0xYFJ4MrWTs0tQCVVJajihAlKhNLcaEbmSkolaU_4EF1sc9cpvnWQ23Ll8wyaxgWIXS6ZsUYSS_4BhaSK0B5e_gkp4ZRZbbTu6fkvuoxdCv2_pTHCcCqF_smbpZhzgrpcJ79y6aNPKr8aKm25bainZ1vqAeCb7Y6fP6-IAw</recordid><startdate>20001101</startdate><enddate>20001101</enddate><creator>Bhulai, S.</creator><creator>Koole, G.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7TB</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>F28</scope><scope>H8D</scope></search><sort><creationdate>20001101</creationdate><title>On the value of learning for Bernoulli bandits with unknown parameters</title><author>Bhulai, S. ; Koole, G.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c434t-4e7ef075957809607b3e25e369fa6d9eeb56b41630241046738720d655651fa63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2000</creationdate><topic>Adaptive control</topic><topic>Arm</topic><topic>Automatic control</topic><topic>Bayesian analysis</topic><topic>Bayesian methods</topic><topic>Clinical trials</topic><topic>Closed-form solution</topic><topic>Decision making</topic><topic>Diseases</topic><topic>Dynamic programming</topic><topic>Equations</topic><topic>Learning</topic><topic>Minimax techniques</topic><topic>Plugs</topic><topic>Upper bound</topic><topic>Upper bounds</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bhulai, S.</creatorcontrib><creatorcontrib>Koole, G.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Aerospace Database</collection><jtitle>IEEE transactions on automatic control</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bhulai, S.</au><au>Koole, G.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On the value of learning for Bernoulli bandits with unknown parameters</atitle><jtitle>IEEE transactions on automatic control</jtitle><stitle>TAC</stitle><date>2000-11-01</date><risdate>2000</risdate><volume>45</volume><issue>11</issue><spage>2135</spage><epage>2140</epage><pages>2135-2140</pages><issn>0018-9286</issn><eissn>1558-2523</eissn><coden>IETAA9</coden><abstract>Investigates the multiarmed bandit problem, where each arm generates an infinite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be beta-distributed. Every time a bandit is selected, its beta-distribution is updated to new information in a Bayesian way. The objective is to maximize the long-term discounted rewards. We study the relationship between the necessity of acquiring additional information and the reward. This is done by considering two extreme situations, which occur when a bandit has been played N times: the situation where the decision maker stops learning and the situation where the decision maker acquires full information about that bandit. We show that the difference in reward between this lower and upper bound goes to zero as N grows large.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/9.887641</doi><tpages>6</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 0018-9286 |
ispartof | IEEE transactions on automatic control, 2000-11, Vol.45 (11), p.2135-2140 |
issn | 0018-9286 1558-2523 |
language | eng |
recordid | cdi_proquest_miscellaneous_28451601 |
source | IEEE Electronic Library (IEL) |
subjects | Adaptive control Arm Automatic control Bayesian analysis Bayesian methods Clinical trials Closed-form solution Decision making Diseases Dynamic programming Equations Learning Minimax techniques Plugs Upper bound Upper bounds |
title | On the value of learning for Bernoulli bandits with unknown parameters |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T23%3A04%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20the%20value%20of%20learning%20for%20Bernoulli%20bandits%20with%20unknown%20parameters&rft.jtitle=IEEE%20transactions%20on%20automatic%20control&rft.au=Bhulai,%20S.&rft.date=2000-11-01&rft.volume=45&rft.issue=11&rft.spage=2135&rft.epage=2140&rft.pages=2135-2140&rft.issn=0018-9286&rft.eissn=1558-2523&rft.coden=IETAA9&rft_id=info:doi/10.1109/9.887641&rft_dat=%3Cproquest_RIE%3E28985090%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=884831547&rft_id=info:pmid/&rft_ieee_id=887641&rfr_iscdi=true |