The Value of Reward Lookahead in Reinforcement Learning
In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information i...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In reinforcement learning (RL), agents sequentially interact with changing
environments while aiming to maximize the obtained rewards. Usually, rewards
are observed only after acting, and so the goal is to maximize the expected
cumulative reward. Yet, in many practical settings, reward information is
observed in advance -- prices are observed before performing transactions;
nearby traffic information is partially known; and goals are oftentimes given
to agents prior to the interaction. In this work, we aim to quantifiably
analyze the value of such future reward information through the lens of
competitive analysis. In particular, we measure the ratio between the value of
standard RL agents and that of agents with partial future-reward lookahead. We
characterize the worst-case reward distribution and derive exact ratios for the
worst-case reward expectations. Surprisingly, the resulting ratios relate to
known quantities in offline RL and reward-free exploration. We further provide
tight bounds for the ratio given the worst-case dynamics. Our results cover the
full spectrum between observing the immediate rewards before acting to
observing all the rewards before the interaction starts. |
---|---|
DOI: | 10.48550/arxiv.2403.11637 |