Training Transition Policies via Distribution Matching for Complex Tasks
Humans decompose novel complex tasks into simpler ones to exploit previously learned skills. Analogously, hierarchical reinforcement learning seeks to leverage lower-level policies for simple tasks to solve complex ones. However, because each lower-level policy induces a different distribution of st...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Humans decompose novel complex tasks into simpler ones to exploit previously
learned skills. Analogously, hierarchical reinforcement learning seeks to
leverage lower-level policies for simple tasks to solve complex ones. However,
because each lower-level policy induces a different distribution of states,
transitioning from one lower-level policy to another may fail due to an
unexpected starting state. We introduce transition policies that smoothly
connect lower-level policies by producing a distribution of states and actions
that matches what is expected by the next policy. Training transition policies
is challenging because the natural reward signal -- whether the next policy can
execute its subtask successfully -- is sparse. By training transition policies
via adversarial inverse reinforcement learning to match the distribution of
expected states and actions, we avoid relying on task-based reward. To further
improve performance, we use deep Q-learning with a binary action space to
determine when to switch from a transition policy to the next pre-trained
policy, using the success or failure of the next subtask as the reward.
Although the reward is still sparse, the problem is less severe due to the
simple binary action space. We demonstrate our method on continuous bipedal
locomotion and arm manipulation tasks that require diverse skills. We show that
it smoothly connects the lower-level policies, achieving higher success rates
than previous methods that search for successful trajectories based on a reward
function, but do not match the state distribution. |
---|---|
DOI: | 10.48550/arxiv.2110.04357 |