TRAINING REINFORCEMENT LEARNING AGENTS TO PERFORM MULTIPLE TASKS ACROSS DIVERSE DOMAINS

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network used to select an action to be performed by an agent interacting with an environment. In one aspect, a method includes: receiving a latent representation that chara...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	PASUKONIS, Jurgis, LILLICRAP, Timothy Paul, HAFNER, Danijar
Format:	Patent
Sprache:	eng ; fre
Schlagworte:	CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network used to select an action to be performed by an agent interacting with an environment. In one aspect, a method includes: receiving a latent representation that characterizes a current state of the environment; generating an imagination trajectory of latent representations; for each latent representation in the imagination trajectory: determining a predicted reward; and generating a predicted state value; determining a target state value for each latent representation; determining an update to the current values of the policy network parameters; applying a symmetric logarithmic transformation to each target state value; encoding each transformed target state value to generate an encoded transformed target state value; and determining an update to the current values of the value network parameters by optimizing a critic objective function. L'invention concerne des procédés, des systèmes et un appareil, comprenant des programmes informatiques codés sur un support d'enregistrement informatique, destinés à entraîner un réseau neuronal de politiques utilisé pour sélectionner une action à effectuer par un agent interagissant avec un environnement. Selon un aspect, un procédé consiste à : recevoir une représentation latente qui caractérise un état actuel de l'environnement ; générer une trajectoire d'imagination de représentations latentes ; pour chaque représentation latente dans la trajectoire d'imagination : déterminer une récompense prédite ; et générer une valeur d'état prédite ; déterminer une valeur d'état cible pour chaque représentation latente ; déterminer une mise à jour des valeurs actuelles des paramètres de réseau de politiques ; appliquer une transformation logarithmique symétrique à chaque valeur d'état cible ; coder chaque valeur d'état cible transformée pour générer une valeur d'état cible transformée codée ; et déterminer une mise à jour des valeurs actuelles des paramètres de réseau de valeur par optimisation d'une fonction objective critique.