A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits

In a typical stochastic multi-armed bandit problem, the objective is often to maximize the expected sum of rewards over some time horizon \(T\). While the choice of a strategy that accomplishes that is optimal with no additional information, it is no longer the case when provided additional environm...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-10
Hauptverfasser:	Alami, Reda, Mahfoud, Mohammed, Achab, Mastane
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Decision theory Multi-armed bandit problems Nonstationary environments Optimization Strategy
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In a typical stochastic multi-armed bandit problem, the objective is often to maximize the expected sum of rewards over some time horizon \(T\). While the choice of a strategy that accomplishes that is optimal with no additional information, it is no longer the case when provided additional environment-specific knowledge. In particular, in areas of high volatility like healthcare or finance, a naive reward maximization approach often does not accurately capture the complexity of the learning problem and results in unreliable solutions. To tackle problems of this nature, we propose a framework of adaptive risk-aware strategies that operate in non-stationary environments. Our framework incorporates various risk measures prevalent in the literature to map multiple families of multi-armed bandit algorithms into a risk-sensitive setting. In addition, we equip the resulting algorithms with the Restarted Bayesian Online Change-Point Detection (R-BOCPD) algorithm and impose a (tunable) forced exploration strategy to detect local (per-arm) switches. We provide finite-time theoretical guarantees and an asymptotic regret bound of order \(\tilde O(\sqrt{K_T T})\) up to time horizon \(T\) with \(K_T\) the total number of change-points. In practice, our framework compares favorably to the state-of-the-art in both synthetic and real-world environments and manages to perform efficiently with respect to both risk-sensitivity and non-stationarity.
ISSN:	2331-8422