Plug-and-Play Model-Agnostic Counterfactual Policy Synthesis for Deep Reinforcement Learning-Based Recommendation

Recent advances in recommender systems have proved the potential of reinforcement learning (RL) to handle the dynamic evolution processes between users and recommender systems. However, learning to train an optimal RL agent is generally impractical with commonly sparse user feedback data in the cont...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2023-11, Vol.PP, p.1-12
Hauptverfasser:	Wang, Siyu, Chen, Xiaocong, McAuley, Julian, Cripps, Sally, Yao, Lina
Format:	Artikel
Sprache:	eng
Schlagworte:	Australia Causality Computer science counterfactual Data models deep reinforcement learning (DRL) Learning systems Mathematical models policy synthesis Recommender systems Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recent advances in recommender systems have proved the potential of reinforcement learning (RL) to handle the dynamic evolution processes between users and recommender systems. However, learning to train an optimal RL agent is generally impractical with commonly sparse user feedback data in the context of recommender systems. To circumvent the lack of interaction of current RL-based recommender systems, we propose to learn a general model-agnostic counterfactual synthesis (MACS) policy for counterfactual user interaction data augmentation. The counterfactual synthesis policy aims to synthesize counterfactual states while preserving significant information in the original state relevant to the user's interests, building upon two different training approaches we designed: learning with expert demonstrations and joint training. As a result, the synthesis of each counterfactual data is based on the current recommendation agent's interaction with the environment to adapt to users' dynamic interests. We integrate the proposed policy deep deterministic policy gradient (DDPG), soft actor critic (SAC), and twin delayed DDPG (TD3) in an adaptive pipeline with a recommendation agent that can generate counterfactual data to improve the performance of recommendation. The empirical results on both online simulation and offline datasets demonstrate the effectiveness and generalization of our counterfactual synthesis policy and verify that it improves the performance of RL recommendation agents.
ISSN:	2162-237X 2162-2388
DOI:	10.1109/TNNLS.2023.3329808