Rotting Infinitely Many-armed Bandits
We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ w...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We consider the infinitely many-armed bandit problem with rotting rewards,
where the mean reward of an arm decreases at each pull of the arm according to
an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this
learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case
regret lower bound where $T$ is the horizon time. We show that a matching upper
bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic
factor, can be achieved by an algorithm that uses a UCB index for each arm and
a threshold value to decide whether to continue pulling an arm or remove the
arm from further consideration, when the algorithm knows the value of the
maximum rotting rate $\varrho$. We also show that an
$\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved
by an algorithm that does not know the value of $\varrho$, by using an adaptive
UCB index along with an adaptive threshold value. |
---|---|
DOI: | 10.48550/arxiv.2201.12975 |