Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Multi-armed bandit models have proven to be useful in modeling many real world problems in the areas of control and sequential decision making with partial information. However, in many scenarios, such as those prevalent in healthcare and operations management, the decision maker's expected rew...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	He, Qinyang, Mintz, Yonatan
Format:	Artikel
Sprache:	eng
Schlagworte:	Mathematics - Optimization and Control
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	He, Qinyang Mintz, Yonatan
description	Multi-armed bandit models have proven to be useful in modeling many real world problems in the areas of control and sequential decision making with partial information. However, in many scenarios, such as those prevalent in healthcare and operations management, the decision maker's expected reward will decrease if an action is selected too frequently while it may recover if they abstain from selecting this action. This scenario is further complicated when choosing a particular action also expends a random amount of a limited resource where the distribution is also initially unknown to the decision maker. In this paper we study a class of models that address this setting that we call reducing or gaining unknown efficacy bandits with stochastic knapsack constraints (ROGUEwK). We propose a combination upper confidence bound (UCB) and lower confidence bound (LCB) approximation algorithm for optimizing this model. Our algorithm chooses which action to play at each time point by solving a linear program (LP) with the UCB for the average rewards and LCB for the average costs as inputs. We show that the regret of our algorithm is sub-linear as a function of time and total constraint budget when compared to a dynamic oracle. We validate the performance of our algorithm against existing state of the art non-stationary and knapsack bandit approaches in a simulation study and show that our methods are able to on average achieve a 13% improvement in terms of total reward.
doi_str_mv	10.48550/arxiv.2403.17073
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_17073</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_17073</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2403_170733</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1jM0NzA35mQI88vP0y0uSSzJzM9LLKpUcErMS8ksKVYozyzJUPBITMosKQXLKQDFFYJSk_PLUoGqXCrzEnMzk4vBot55iQXFicnZCs75ecUlRYmZeSXFPAysaYk5xam8UJqbQd7NNcTZQxfsgviCosxcoG3xIJfEg11iTFgFANRQP0s</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints</title><source>arXiv.org</source><creator>He, Qinyang ; Mintz, Yonatan</creator><creatorcontrib>He, Qinyang ; Mintz, Yonatan</creatorcontrib><description>Multi-armed bandit models have proven to be useful in modeling many real world problems in the areas of control and sequential decision making with partial information. However, in many scenarios, such as those prevalent in healthcare and operations management, the decision maker's expected reward will decrease if an action is selected too frequently while it may recover if they abstain from selecting this action. This scenario is further complicated when choosing a particular action also expends a random amount of a limited resource where the distribution is also initially unknown to the decision maker. In this paper we study a class of models that address this setting that we call reducing or gaining unknown efficacy bandits with stochastic knapsack constraints (ROGUEwK). We propose a combination upper confidence bound (UCB) and lower confidence bound (LCB) approximation algorithm for optimizing this model. Our algorithm chooses which action to play at each time point by solving a linear program (LP) with the UCB for the average rewards and LCB for the average costs as inputs. We show that the regret of our algorithm is sub-linear as a function of time and total constraint budget when compared to a dynamic oracle. We validate the performance of our algorithm against existing state of the art non-stationary and knapsack bandit approaches in a simulation study and show that our methods are able to on average achieve a 13% improvement in terms of total reward.</description><identifier>DOI: 10.48550/arxiv.2403.17073</identifier><language>eng</language><subject>Mathematics - Optimization and Control</subject><creationdate>2024-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.17073$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.17073$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>He, Qinyang</creatorcontrib><creatorcontrib>Mintz, Yonatan</creatorcontrib><title>Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints</title><description>Multi-armed bandit models have proven to be useful in modeling many real world problems in the areas of control and sequential decision making with partial information. However, in many scenarios, such as those prevalent in healthcare and operations management, the decision maker's expected reward will decrease if an action is selected too frequently while it may recover if they abstain from selecting this action. This scenario is further complicated when choosing a particular action also expends a random amount of a limited resource where the distribution is also initially unknown to the decision maker. In this paper we study a class of models that address this setting that we call reducing or gaining unknown efficacy bandits with stochastic knapsack constraints (ROGUEwK). We propose a combination upper confidence bound (UCB) and lower confidence bound (LCB) approximation algorithm for optimizing this model. Our algorithm chooses which action to play at each time point by solving a linear program (LP) with the UCB for the average rewards and LCB for the average costs as inputs. We show that the regret of our algorithm is sub-linear as a function of time and total constraint budget when compared to a dynamic oracle. We validate the performance of our algorithm against existing state of the art non-stationary and knapsack bandit approaches in a simulation study and show that our methods are able to on average achieve a 13% improvement in terms of total reward.</description><subject>Mathematics - Optimization and Control</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1jM0NzA35mQI88vP0y0uSSzJzM9LLKpUcErMS8ksKVYozyzJUPBITMosKQXLKQDFFYJSk_PLUoGqXCrzEnMzk4vBot55iQXFicnZCs75ecUlRYmZeSXFPAysaYk5xam8UJqbQd7NNcTZQxfsgviCosxcoG3xIJfEg11iTFgFANRQP0s</recordid><startdate>20240325</startdate><enddate>20240325</enddate><creator>He, Qinyang</creator><creator>Mintz, Yonatan</creator><scope>AKZ</scope><scope>GOX</scope></search><sort><creationdate>20240325</creationdate><title>Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints</title><author>He, Qinyang ; Mintz, Yonatan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2403_170733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Mathematics - Optimization and Control</topic><toplevel>online_resources</toplevel><creatorcontrib>He, Qinyang</creatorcontrib><creatorcontrib>Mintz, Yonatan</creatorcontrib><collection>arXiv Mathematics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>He, Qinyang</au><au>Mintz, Yonatan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints</atitle><date>2024-03-25</date><risdate>2024</risdate><abstract>Multi-armed bandit models have proven to be useful in modeling many real world problems in the areas of control and sequential decision making with partial information. However, in many scenarios, such as those prevalent in healthcare and operations management, the decision maker's expected reward will decrease if an action is selected too frequently while it may recover if they abstain from selecting this action. This scenario is further complicated when choosing a particular action also expends a random amount of a limited resource where the distribution is also initially unknown to the decision maker. In this paper we study a class of models that address this setting that we call reducing or gaining unknown efficacy bandits with stochastic knapsack constraints (ROGUEwK). We propose a combination upper confidence bound (UCB) and lower confidence bound (LCB) approximation algorithm for optimizing this model. Our algorithm chooses which action to play at each time point by solving a linear program (LP) with the UCB for the average rewards and LCB for the average costs as inputs. We show that the regret of our algorithm is sub-linear as a function of time and total constraint budget when compared to a dynamic oracle. We validate the performance of our algorithm against existing state of the art non-stationary and knapsack bandit approaches in a simulation study and show that our methods are able to on average achieve a 13% improvement in terms of total reward.</abstract><doi>10.48550/arxiv.2403.17073</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2403.17073
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2403_17073
source	arXiv.org
subjects	Mathematics - Optimization and Control
title	Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T21%3A55%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Non-stationary%20Bandits%20with%20Habituation%20and%20Recovery%20Dynamics%20and%20Knapsack%20Constraints&rft.au=He,%20Qinyang&rft.date=2024-03-25&rft_id=info:doi/10.48550/arxiv.2403.17073&rft_dat=%3Carxiv_GOX%3E2403_17073%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true