Gaussian processes for informative exploration in reinforcement learning

This paper presents the iGP-SARSA(λ) algorithm for temporal difference reinforcement learning (RL) with non-myopic information gain considerations. The proposed algorithm uses a Gaussian process (GP) model to approximate the state-action value function, Q, and incorporates the variance measure from...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jen Jen Chung, Lawrance, Nicholas R. J., Sukkarieh, Salah
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Approximation algorithms Batteries Discharges (electric) Function approximation Tiles Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper presents the iGP-SARSA(λ) algorithm for temporal difference reinforcement learning (RL) with non-myopic information gain considerations. The proposed algorithm uses a Gaussian process (GP) model to approximate the state-action value function, Q, and incorporates the variance measure from the GP into the calculation of the discounted information gain value for all future state-actions rolled out from the current state-action. The algorithm was compared against a standard SARSA(λ) algorithm on two simulated examples: a battery charge/discharge problem, and a soaring glider problem. Results show that incorporating the information gain value into the action selection encouraged exploration early on, allowing the iGP-SARSA(λ) algorithm to converge to a more profitable reward cycle, while the e-greedy exploration strategy in the SARSA(λ) algorithm failed to search beyond the local optimal solution.
ISSN:	1050-4729 2577-087X
DOI:	10.1109/ICRA.2013.6630938