Discovering Temporally-Aware Reinforcement Learning Algorithms
Recent advancements in meta-learning have enabled the automatic discovery of novel reinforcement learning algorithms parameterized by surrogate objective functions. To improve upon manually designed algorithms, the parameterization of this learned objective function must be expressive enough to repr...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advancements in meta-learning have enabled the automatic discovery of
novel reinforcement learning algorithms parameterized by surrogate objective
functions. To improve upon manually designed algorithms, the parameterization
of this learned objective function must be expressive enough to represent novel
principles of learning (instead of merely recovering already established ones)
while still generalizing to a wide range of settings outside of its
meta-training distribution. However, existing methods focus on discovering
objective functions that, like many widely used objective functions in
reinforcement learning, do not take into account the total number of steps
allowed for training, or "training horizon". In contrast, humans use a plethora
of different learning objectives across the course of acquiring a new ability.
For instance, students may alter their studying techniques based on the
proximity to exam deadlines and their self-assessed capabilities. This paper
contends that ignoring the optimization time horizon significantly restricts
the expressive potential of discovered learning algorithms. We propose a simple
augmentation to two existing objective discovery approaches that allows the
discovered algorithm to dynamically update its objective function throughout
the agent's training procedure, resulting in expressive schedules and increased
generalization across different training horizons. In the process, we find that
commonly used meta-gradient approaches fail to discover such adaptive objective
functions while evolution strategies discover highly dynamic learning rules. We
demonstrate the effectiveness of our approach on a wide range of tasks and
analyze the resulting learned algorithms, which we find effectively balance
exploration and exploitation by modifying the structure of their learning rules
throughout the agent's lifetime. |
---|---|
DOI: | 10.48550/arxiv.2402.05828 |