Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, the agent works in two phases. In the exploration phase, the agent interacts with the environment and collects samples without the reward. In...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We study the model-based reward-free reinforcement learning with linear
function approximation for episodic Markov decision processes (MDPs). In this
setting, the agent works in two phases. In the exploration phase, the agent
interacts with the environment and collects samples without the reward. In the
planning phase, the agent is given a specific reward function and uses samples
collected from the exploration phase to learn a good policy. We propose a new
provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP
assumption, where the transition probability kernel of the MDP can be
parameterized by a linear function over certain feature mappings defined on the
triplet of state, action, and next state. We show that to obtain an
$\epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to
sample at most $\tilde{\mathcal{O}}(H^5d^2\epsilon^{-2})$ episodes during the
exploration phase. Here, $H$ is the length of the episode, $d$ is the dimension
of the feature mapping. We also propose a variant of UCRL-RFE using
Bernstein-type bonus and show that it needs to sample at most
$\tilde{\mathcal{O}}(H^4d(H + d)\epsilon^{-2})$ to achieve an
$\epsilon$-optimal policy. By constructing a special class of linear Mixture
MDPs, we also prove that for any reward-free algorithm, it needs to sample at
least $\tilde \Omega(H^2d\epsilon^{-2})$ episodes to obtain an
$\epsilon$-optimal policy. Our upper bound matches the lower bound in terms of
the dependence on $\epsilon$ and the dependence on $d$ if $H \ge d$. |
---|---|
DOI: | 10.48550/arxiv.2110.06394 |