Towards Off-Policy Reinforcement Learning for Ranking Policies with Human Feedback
Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making proble...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Probabilistic learning to rank (LTR) has been the dominating approach for
optimizing the ranking metric, but cannot maximize long-term rewards.
Reinforcement learning models have been proposed to maximize user long-term
rewards by formulating the recommendation as a sequential decision-making
problem, but could only achieve inferior accuracy compared to LTR counterparts,
primarily due to the lack of online interactions and the characteristics of
ranking. In this paper, we propose a new off-policy value ranking (VR)
algorithm that can simultaneously maximize user long-term rewards and optimize
the ranking metric offline for improved sample efficiency in a unified
Expectation-Maximization (EM) framework. We theoretically and empirically show
that the EM process guides the leaned policy to enjoy the benefit of
integration of the future reward and ranking metric, and learn without any
online interactions. Extensive offline and online experiments demonstrate the
effectiveness of our methods. |
---|---|
DOI: | 10.48550/arxiv.2401.08959 |