Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path
We study the Stochastic Shortest Path (SSP) problem with a linear mixture transition kernel, where an agent repeatedly interacts with a stochastic environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of th...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We study the Stochastic Shortest Path (SSP) problem with a linear mixture
transition kernel, where an agent repeatedly interacts with a stochastic
environment and seeks to reach certain goal state while minimizing the
cumulative cost. Existing works often assume a strictly positive lower bound of
the cost function or an upper bound of the expected length for the optimal
policy. In this paper, we propose a new algorithm to eliminate these
restrictive assumptions. Our algorithm is based on extended value iteration
with a fine-grained variance-aware confidence set, where the variance is
estimated recursively from high-order moments. Our algorithm achieves an
$\tilde{\mathcal O}(dB_*\sqrt{K})$ regret bound, where $d$ is the dimension of
the feature mapping in the linear transition kernel, $B_*$ is the upper bound
of the total cumulative cost for the optimal policy, and $K$ is the number of
episodes. Our regret upper bound matches the $\Omega(dB_*\sqrt{K})$ lower bound
of linear mixture SSPs in Min et al. (2022), which suggests that our algorithm
is nearly minimax optimal. |
---|---|
DOI: | 10.48550/arxiv.2402.08998 |