Chain of Thought Imitation with Procedure Cloning
Imitation learning aims to extract high-performance policies from logged demonstrations of expert behavior. It is common to frame imitation learning as a supervised learning problem in which one fits a function approximator to the input-output mapping exhibited by the logged demonstrations (input ob...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Imitation learning aims to extract high-performance policies from logged
demonstrations of expert behavior. It is common to frame imitation learning as
a supervised learning problem in which one fits a function approximator to the
input-output mapping exhibited by the logged demonstrations (input observations
to output actions). While the framing of imitation learning as a supervised
input-output learning problem allows for applicability in a wide variety of
settings, it is also an overly simplistic view of the problem in situations
where the expert demonstrations provide much richer insight into expert
behavior. For example, applications such as path navigation, robot
manipulation, and strategy games acquire expert demonstrations via planning,
search, or some other multi-step algorithm, revealing not just the output
action to be imitated but also the procedure for how to determine this action.
While these intermediate computations may use tools not available to the agent
during inference (e.g., environment simulators), they are nevertheless
informative as a way to explain an expert's mapping of state to actions. To
properly leverage expert procedure information without relying on the
privileged tools the expert may have used to perform the procedure, we propose
procedure cloning, which applies supervised sequence prediction to imitate the
series of expert computations. This way, procedure cloning learns not only what
to do (i.e., the output action), but how and why to do it (i.e., the
procedure). Through empirical analysis on navigation, simulated robotic
manipulation, and game-playing environments, we show that imitating the
intermediate computations of an expert's behavior enables procedure cloning to
learn policies exhibiting significant generalization to unseen environment
configurations, including those configurations for which running the expert's
procedure directly is infeasible. |
---|---|
DOI: | 10.48550/arxiv.2205.10816 |