p-Mean Regret for Stochastic Bandits
In this work, we extend the concept of the $p$-mean welfare objective from social choice theory (Moulin 2004) to study $p$-mean regret in stochastic multi-armed bandit problems. The $p$-mean regret, defined as the difference between the optimal mean among the arms and the $p$-mean of the expected re...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this work, we extend the concept of the $p$-mean welfare objective from
social choice theory (Moulin 2004) to study $p$-mean regret in stochastic
multi-armed bandit problems. The $p$-mean regret, defined as the difference
between the optimal mean among the arms and the $p$-mean of the expected
rewards, offers a flexible framework for evaluating bandit algorithms, enabling
algorithm designers to balance fairness and efficiency by adjusting the
parameter $p$. Our framework encompasses both average cumulative regret and
Nash regret as special cases.
We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that
achieves novel $p$-mean regret bounds. Our algorithm consists of two phases: a
carefully calibrated uniform exploration phase to initialize sample means,
followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under
mild assumptions, we prove that our algorithm achieves a $p$-mean regret bound
of $\tilde{O}\left(\sqrt{\frac{k}{T^{\frac{1}{2|p|}}}\right)$ for all $p \leq
-1$, where $k$ represents the number of arms and $T$ the time horizon. When
$-1 |
---|---|
DOI: | 10.48550/arxiv.2412.10751 |