The Projected Bellman Equation in Reinforcement Learning

Q-learning has become an important part of the reinforcement learning toolkit since its introduction in the dissertation of Chris Watkins in the 1980s. In the original tabular formulation, the goal is to compute exactly a solution to the discounted-cost optimality equation, and thereby, obtain the o...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on automatic control 2024-12, Vol.69 (12), p.8323-8337
1. Verfasser:	Meyn, Sean
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Approximation algorithms Bellman theory Convergence Function approximation Machine learning Markov processes Mathematical models optimal control Optimization Parameter estimation Q-learning reinforcement learning Training Vectors
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Q-learning has become an important part of the reinforcement learning toolkit since its introduction in the dissertation of Chris Watkins in the 1980s. In the original tabular formulation, the goal is to compute exactly a solution to the discounted-cost optimality equation, and thereby, obtain the optimal policy for a Markov Decision Process. The goal today is more modest: obtain an approximate solution within a prescribed function class. The standard algorithms are based on the same architecture as formulated in the 1980s, with the goal of finding a value function approximation that solves the so-called projected Bellman equation. While reinforcement learning has been an active research area for over four decades, there is little theory providing conditions for convergence of these Q-learning algorithms, or even existence of a solution to this equation. The purpose of this article is to show that a solution to the projected Bellman equation does exist, provided the function class is linear and the input used for training is a form of \varepsilon-greedy policy with sufficiently small \varepsilon. Moreover, under these conditions it is shown that the Q-learning algorithm is stable, in terms of bounded parameter estimates. Convergence remains one of many open topics for research.
ISSN:	0018-9286 1558-2523
DOI:	10.1109/TAC.2024.3409647