Elastic step DQN: A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks

Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to successfully surpass human level performance in a number of Atari learning environments. However, divergent and unstable behaviour have been long standing issues in DQNs. The unstable behaviou...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neurocomputing (Amsterdam) 2024-04, Vol.576, p.127170, Article 127170
Hauptverfasser:	Ly, Adrian, Dazeley, Richard, Vamplew, Peter, Cruz, Francisco, Aryal, Sunil
Format:	Artikel
Sprache:	eng
Schlagworte:	DQN Multi-step update Neural network Overestimation Reinforcement learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to successfully surpass human level performance in a number of Atari learning environments. However, divergent and unstable behaviour have been long standing issues in DQNs. The unstable behaviour is often characterised by overestimation in the Q-values, commonly referred to as the overestimation bias. To address the overestimation bias and the divergent behaviour, a number of heuristic extensions have been proposed. Notably, multi-step updates have been shown to drastically reduce unstable behaviour while improving agent’s training performance. However, agents are often highly sensitive to the selection of the multi-step update horizon (n), and our empirical experiments show that a poorly chosen static value for n can in many cases lead to worse performance than single-step DQN. Inspired by the success of n-step DQN and the effects that multi-step updates have on overestimation bias, this paper proposes a new algorithm that we call ‘Elastic Step DQN’ (ES-DQN) to alleviate overestimation bias in DQNs. ES-DQN dynamically varies the step size horizon in multi-step updates based on the similarity between states visited. Our empirical evaluation shows that ES-DQN out-performs n-step with fixed n updates, Double DQN and Average DQN in several OpenAI Gym environments while at the same time alleviating the overestimation bias.
ISSN:	0925-2312 1872-8286
DOI:	10.1016/j.neucom.2023.127170