SlateFree: a Model-Free Decomposition for Reinforcement Learning with Slate Actions
We consider the problem of sequential recommendations, where at each step an agent proposes some slate of $N$ distinct items to a user from a much larger catalog of size $K>>N$. The user has unknown preferences towards the recommendations and the agent takes sequential actions that optimise (i...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We consider the problem of sequential recommendations, where at each step an
agent proposes some slate of $N$ distinct items to a user from a much larger
catalog of size $K>>N$. The user has unknown preferences towards the
recommendations and the agent takes sequential actions that optimise (in our
case minimise) some user-related cost, with the help of Reinforcement Learning.
The possible item combinations for a slate is $\binom{K}{N}$, an enormous
number rendering value iteration methods intractable. We prove that the
slate-MDP can actually be decomposed using just $K$ item-related $Q$ functions
per state, which describe the problem in a more compact and efficient way.
Based on this, we propose a novel model-free SARSA and Q-learning algorithm
that performs $N$ parallel iterations per step, without any prior user
knowledge. We call this method \texttt{SlateFree}, i.e. free-of-slates, and we
show numerically that it converges very fast to the exact optimum for arbitrary
user profiles, and that it outperforms alternatives from the literature. |
---|---|
DOI: | 10.48550/arxiv.2209.01876 |