IMITATION LEARNING BASED ON PREDICTION OF OUTCOMES
A method is proposed of training a policy model to generate action data for controlling an agent to perform a task in an environment. The method comprises: obtaining, for each of a plurality of performances of the task, a corresponding demonstrator trajectory comprising a plurality of sets of state...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Patent |
Sprache: | eng ; fre ; ger |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A method is proposed of training a policy model to generate action data for controlling an agent to perform a task in an environment. The method comprises: obtaining, for each of a plurality of performances of the task, a corresponding demonstrator trajectory comprising a plurality of sets of state data characterizing the environment at each of a plurality of corresponding successive time steps during the performance of the task; using the demonstrator trajectories to generate a demonstrator model, the demonstrator model being operative to generate, for any said demonstrator trajectory, a value indicative of the probability of the demonstrator trajectory occurring; and jointly training an imitator model and a policy model. The joint training is performed by: generating a plurality of imitation trajectories, each imitation trajectory being generated by repeatedly receiving state data indicating a state of the environment, using the policy model to generate action data indicative of an action, and causing the action to be performed by the agent; training the imitator model using the imitation trajectories, the imitator model being operative to generate, for any said imitation trajectory, a value indicative of the probability of the imitation trajectory occurring; and training the policy model using a reward function which is a measure of the similarity of the demonstrator model and the imitator model. |
---|