Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints
Multi-armed bandit models have proven to be useful in modeling many real world problems in the areas of control and sequential decision making with partial information. However, in many scenarios, such as those prevalent in healthcare and operations management, the decision maker's expected rew...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | He, Qinyang Mintz, Yonatan |
description | Multi-armed bandit models have proven to be useful in modeling many real
world problems in the areas of control and sequential decision making with
partial information. However, in many scenarios, such as those prevalent in
healthcare and operations management, the decision maker's expected reward will
decrease if an action is selected too frequently while it may recover if they
abstain from selecting this action. This scenario is further complicated when
choosing a particular action also expends a random amount of a limited resource
where the distribution is also initially unknown to the decision maker. In this
paper we study a class of models that address this setting that we call
reducing or gaining unknown efficacy bandits with stochastic knapsack
constraints (ROGUEwK). We propose a combination upper confidence bound (UCB)
and lower confidence bound (LCB) approximation algorithm for optimizing this
model. Our algorithm chooses which action to play at each time point by solving
a linear program (LP) with the UCB for the average rewards and LCB for the
average costs as inputs. We show that the regret of our algorithm is sub-linear
as a function of time and total constraint budget when compared to a dynamic
oracle. We validate the performance of our algorithm against existing state of
the art non-stationary and knapsack bandit approaches in a simulation study and
show that our methods are able to on average achieve a 13% improvement in terms
of total reward. |
doi_str_mv | 10.48550/arxiv.2403.17073 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_17073</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_17073</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2403_170733</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1jM0NzA35mQI88vP0y0uSSzJzM9LLKpUcErMS8ksKVYozyzJUPBITMosKQXLKQDFFYJSk_PLUoGqXCrzEnMzk4vBot55iQXFicnZCs75ecUlRYmZeSXFPAysaYk5xam8UJqbQd7NNcTZQxfsgviCosxcoG3xIJfEg11iTFgFANRQP0s</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints</title><source>arXiv.org</source><creator>He, Qinyang ; Mintz, Yonatan</creator><creatorcontrib>He, Qinyang ; Mintz, Yonatan</creatorcontrib><description>Multi-armed bandit models have proven to be useful in modeling many real
world problems in the areas of control and sequential decision making with
partial information. However, in many scenarios, such as those prevalent in
healthcare and operations management, the decision maker's expected reward will
decrease if an action is selected too frequently while it may recover if they
abstain from selecting this action. This scenario is further complicated when
choosing a particular action also expends a random amount of a limited resource
where the distribution is also initially unknown to the decision maker. In this
paper we study a class of models that address this setting that we call
reducing or gaining unknown efficacy bandits with stochastic knapsack
constraints (ROGUEwK). We propose a combination upper confidence bound (UCB)
and lower confidence bound (LCB) approximation algorithm for optimizing this
model. Our algorithm chooses which action to play at each time point by solving
a linear program (LP) with the UCB for the average rewards and LCB for the
average costs as inputs. We show that the regret of our algorithm is sub-linear
as a function of time and total constraint budget when compared to a dynamic
oracle. We validate the performance of our algorithm against existing state of
the art non-stationary and knapsack bandit approaches in a simulation study and
show that our methods are able to on average achieve a 13% improvement in terms
of total reward.</description><identifier>DOI: 10.48550/arxiv.2403.17073</identifier><language>eng</language><subject>Mathematics - Optimization and Control</subject><creationdate>2024-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.17073$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.17073$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>He, Qinyang</creatorcontrib><creatorcontrib>Mintz, Yonatan</creatorcontrib><title>Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints</title><description>Multi-armed bandit models have proven to be useful in modeling many real
world problems in the areas of control and sequential decision making with
partial information. However, in many scenarios, such as those prevalent in
healthcare and operations management, the decision maker's expected reward will
decrease if an action is selected too frequently while it may recover if they
abstain from selecting this action. This scenario is further complicated when
choosing a particular action also expends a random amount of a limited resource
where the distribution is also initially unknown to the decision maker. In this
paper we study a class of models that address this setting that we call
reducing or gaining unknown efficacy bandits with stochastic knapsack
constraints (ROGUEwK). We propose a combination upper confidence bound (UCB)
and lower confidence bound (LCB) approximation algorithm for optimizing this
model. Our algorithm chooses which action to play at each time point by solving
a linear program (LP) with the UCB for the average rewards and LCB for the
average costs as inputs. We show that the regret of our algorithm is sub-linear
as a function of time and total constraint budget when compared to a dynamic
oracle. We validate the performance of our algorithm against existing state of
the art non-stationary and knapsack bandit approaches in a simulation study and
show that our methods are able to on average achieve a 13% improvement in terms
of total reward.</description><subject>Mathematics - Optimization and Control</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1jM0NzA35mQI88vP0y0uSSzJzM9LLKpUcErMS8ksKVYozyzJUPBITMosKQXLKQDFFYJSk_PLUoGqXCrzEnMzk4vBot55iQXFicnZCs75ecUlRYmZeSXFPAysaYk5xam8UJqbQd7NNcTZQxfsgviCosxcoG3xIJfEg11iTFgFANRQP0s</recordid><startdate>20240325</startdate><enddate>20240325</enddate><creator>He, Qinyang</creator><creator>Mintz, Yonatan</creator><scope>AKZ</scope><scope>GOX</scope></search><sort><creationdate>20240325</creationdate><title>Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints</title><author>He, Qinyang ; Mintz, Yonatan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2403_170733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Mathematics - Optimization and Control</topic><toplevel>online_resources</toplevel><creatorcontrib>He, Qinyang</creatorcontrib><creatorcontrib>Mintz, Yonatan</creatorcontrib><collection>arXiv Mathematics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>He, Qinyang</au><au>Mintz, Yonatan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints</atitle><date>2024-03-25</date><risdate>2024</risdate><abstract>Multi-armed bandit models have proven to be useful in modeling many real
world problems in the areas of control and sequential decision making with
partial information. However, in many scenarios, such as those prevalent in
healthcare and operations management, the decision maker's expected reward will
decrease if an action is selected too frequently while it may recover if they
abstain from selecting this action. This scenario is further complicated when
choosing a particular action also expends a random amount of a limited resource
where the distribution is also initially unknown to the decision maker. In this
paper we study a class of models that address this setting that we call
reducing or gaining unknown efficacy bandits with stochastic knapsack
constraints (ROGUEwK). We propose a combination upper confidence bound (UCB)
and lower confidence bound (LCB) approximation algorithm for optimizing this
model. Our algorithm chooses which action to play at each time point by solving
a linear program (LP) with the UCB for the average rewards and LCB for the
average costs as inputs. We show that the regret of our algorithm is sub-linear
as a function of time and total constraint budget when compared to a dynamic
oracle. We validate the performance of our algorithm against existing state of
the art non-stationary and knapsack bandit approaches in a simulation study and
show that our methods are able to on average achieve a 13% improvement in terms
of total reward.</abstract><doi>10.48550/arxiv.2403.17073</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2403.17073 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2403_17073 |
source | arXiv.org |
subjects | Mathematics - Optimization and Control |
title | Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T21%3A55%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Non-stationary%20Bandits%20with%20Habituation%20and%20Recovery%20Dynamics%20and%20Knapsack%20Constraints&rft.au=He,%20Qinyang&rft.date=2024-03-25&rft_id=info:doi/10.48550/arxiv.2403.17073&rft_dat=%3Carxiv_GOX%3E2403_17073%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |