A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Learning a good representation is a crucial challenge for Reinforcement Learning (RL) agents. Self-predictive learning provides means to jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYOL). Recent work has developed theoretical insights...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Learning a good representation is a crucial challenge for Reinforcement
Learning (RL) agents. Self-predictive learning provides means to jointly learn
a latent representation and dynamics model by bootstrapping from future latent
representations (BYOL). Recent work has developed theoretical insights into
these algorithms by studying a continuous-time ODE model for self-predictive
representation learning under the simplifying assumption that the algorithm
depends on a fixed policy (BYOL-$\Pi$); this assumption is at odds with
practical instantiations of such algorithms, which explicitly condition their
predictions on future actions. In this work, we take a step towards bridging
the gap between theory and practice by analyzing an action-conditional
self-predictive objective (BYOL-AC) using the ODE framework, characterizing its
convergence properties and highlighting important distinctions between the
limiting solutions of the BYOL-$\Pi$ and BYOL-AC dynamics. We show how the two
representations are related by a variance equation. This connection leads to a
novel variance-like action-conditional objective (BYOL-VAR) and its
corresponding ODE. We unify the study of all three objectives through two
complementary lenses; a model-based perspective, where each objective is shown
to be equivalent to a low-rank approximation of certain dynamics, and a
model-free perspective, which establishes relationships between the objectives
and their respective value, Q-value, and advantage function. Our empirical
investigations, encompassing both linear function approximation and Deep RL
environments, demonstrates that BYOL-AC is better overall in a variety of
different settings. |
---|---|
DOI: | 10.48550/arxiv.2406.02035 |