Nearly Minimax Optimal Regret for Multinomial Logistic Bandit
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, p...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we study the contextual multinomial logit (MNL) bandit problem
in which a learning agent sequentially selects an assortment based on
contextual information, and user feedback follows an MNL choice model. There
has been a significant discrepancy between lower and upper regret bounds,
particularly regarding the maximum assortment size $K$. Additionally, the
variation in reward structures between these bounds complicates the quest for
optimality. Under uniform rewards, where all items have the same expected
reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$
and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper
bound of $\tilde{O}(d\sqrt{\smash[b]{T/K}})$. We also provide
instance-dependent minimax regret bounds under uniform rewards. Under
non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper
bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical
studies support these theoretical findings. To the best of our knowledge, this
is the first work in the contextual MNL bandit literature to prove minimax
optimality -- for either uniform or non-uniform reward setting -- and to
propose a computationally efficient algorithm that achieves this optimality up
to logarithmic factors. |
---|---|
DOI: | 10.48550/arxiv.2405.09831 |