JOINTLY UPDATING AGENT CONTROL POLICIES USING ESTIMATED BEST RESPONSES TO CURRENT CONTROL POLICIES
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating control policies for controlling agents in an environment. One of the methods includes, at each of a plurality of iterations: obtaining a current joint control policy for a plurality of age...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Patent |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating control policies for controlling agents in an environment. One of the methods includes, at each of a plurality of iterations: obtaining a current joint control policy for a plurality of agents, the current joint control policy specifying a respective current control policy for each agent; and updating the current joint control policy, comprising, for each agent: generating a respective reward estimate for each of a plurality of alternate control policies that is an estimate of a reward received by the agent if the agent is controlled using the alternate control policy while the other agents are controlled using the respective current control policies; computing a best response for the agent from the respective reward estimates; and updating the respective current control policy for the agent using the best response for the agent. |
---|