Cost-Aware Cascading Bandits

In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an ordered list of items and examines them sequentially, until certain stopping condi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on signal processing 2020, Vol.68, p.3692-3706
Hauptverfasser:	Gan, Chao, Zhou, Ruida, Yang, Jing, Shen, Cong
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms cascading bandits Computational modeling Confidence Cost-aware Decision making Decision theory Engineering Engineering, Electrical & Electronic Gallium nitride Lower bounds Medical services Random variables Science & Technology Signal processing algorithms Technology Upper bound Upper bounds upper confidence bound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an ordered list of items and examines them sequentially, until certain stopping condition is satisfied. Our objective is then to maximize the expected net reward in each step, i.e., the reward obtained in each step minus the total cost incurred in examining the items, by deciding the ordered list of items, as well as when to stop examination. We first consider the setting where the instantaneous cost of pulling an arm is unknown to the learner until it has been pulled. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the offline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cascading Upper Confidence Bound (CC-UCB) algorithm, and show that the cumulative regret scales in O(\log T). We also provide a lower bound for all \alpha-consistent policies, which scales in \Omega (\log T) and matches our upper bound. We then investigate the setting where the instantaneous cost of pulling each arm is available to the learner for its decision-making, and show that a slight modification of the CC-UCB algorithm, termed as CC-UCB2, is order-optimal. The performances of the algorithms are evaluated with both synthetic and real-world data.
ISSN:	1053-587X 1941-0476
DOI:	10.1109/TSP.2020.3001388