Dichotomy of Control: Separating What You Can Control from What You Cannot
Future- or return-conditioned supervised learning is an emerging paradigm for offline reinforcement learning (RL), where the future outcome (i.e., return) associated with an observed action sequence is used as input to a policy trained to imitate those same actions. While return-conditioning is at t...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Future- or return-conditioned supervised learning is an emerging paradigm for
offline reinforcement learning (RL), where the future outcome (i.e., return)
associated with an observed action sequence is used as input to a policy
trained to imitate those same actions. While return-conditioning is at the
heart of popular algorithms such as decision transformer (DT), these methods
tend to perform poorly in highly stochastic environments, where an occasional
high return can arise from randomness in the environment rather than the
actions themselves. Such situations can lead to a learned policy that is
inconsistent with its conditioning inputs; i.e., using the policy to act in the
environment, when conditioning on a specific desired return, leads to a
distribution of real returns that is wildly different than desired. In this
work, we propose the dichotomy of control (DoC), a future-conditioned
supervised learning framework that separates mechanisms within a policy's
control (actions) from those beyond a policy's control (environment
stochasticity). We achieve this separation by conditioning the policy on a
latent variable representation of the future, and designing a mutual
information constraint that removes any information from the latent variable
associated with randomness in the environment. Theoretically, we show that DoC
yields policies that are consistent with their conditioning inputs, ensuring
that conditioning a learned policy on a desired high-return future outcome will
correctly induce high-return behavior. Empirically, we show that DoC is able to
achieve significantly better performance than DT on environments that have
highly stochastic rewards and transition |
---|---|
DOI: | 10.48550/arxiv.2210.13435 |