Unified curiosity-Driven learning with smoothed intrinsic reward estimation
•We propose a novel distribution-aware and policy-aware unified curiosity-driven learning framework to unify state novelty and state-action novelty. DAW enables the agent to explore states diversely, and PAWencourage the agent to explore the states that the policy is uncertain about which action to...
Gespeichert in:
Veröffentlicht in: | Pattern recognition 2022-03, Vol.123, p.108352, Article 108352 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •We propose a novel distribution-aware and policy-aware unified curiosity-driven learning framework to unify state novelty and state-action novelty. DAW enables the agent to explore states diversely, and PAWencourage the agent to explore the states that the policy is uncertain about which action to take. The proposed approach improves the exploration ability of RL with complete intrinsic reward;•We propose to improve the robustness of policy learning by smoothing the intrinsic reward with a batch of transitions close to the current transition; we propose to employ an attention module to extract task-relevant features for a more precise estimation of intrinsic reward;•Extensive experiments on Atari games demonstrate the effectiveness of our approach.
In reinforcement learning (RL), the intrinsic reward estimation is necessary for policy learning when the extrinsic reward is sparse or absent. To this end, Unified Curiosity-driven Learning with Smoothed intrinsic reward Estimation (UCLSE) is proposed to address the sparse extrinsic reward problem from the perspective of completeness of intrinsic reward estimation. We further propose state distribution-aware weighting method and policy-aware weighting method to dynamically unify two mainstream intrinsic reward estimation methods. In this way, the agent can explore the environment more effectively and efficiently. Under this framework, we propose to employ an attention module to extract task-relevant features for a more precise estimation of intrinsic reward. Moreover, we propose to improve the robustness of policy learning by smoothing the intrinsic reward with a batch of transitions close to the current transition. Extensive experimental results on Atari games demonstrate that our method outperforms the state-of-the-art approaches in terms of both score and training efficiency. |
---|---|
ISSN: | 0031-3203 1873-5142 |
DOI: | 10.1016/j.patcog.2021.108352 |