Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative m...
Gespeichert in:
Veröffentlicht in: | Machine learning 2013-06, Vol.91 (3), p.325-349 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with
N
state-action pairs and the discount factor
γ
∈[0,1) only
O
(
N
log(
N
/
δ
)/((1−
γ
)
3
ε
2
)) state-transition samples are required to find an
ε
-optimal estimation of the action-value function with the probability (w.p.) 1−
δ
. Further, we prove that, for small values of
ε
, an order of
O
(
N
log(
N
/
δ
)/((1−
γ
)
3
ε
2
)) samples is required to find an
ε
-optimal policy w.p. 1−
δ
. We also prove a matching lower bound of
Θ
(
N
log(
N
/
δ
)/((1−
γ
)
3
ε
2
)) on the sample complexity of estimating the optimal action-value function with
ε
accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of
N
,
ε
,
δ
and 1/(1−
γ
) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1−
γ
). |
---|---|
ISSN: | 0885-6125 1573-0565 |
DOI: | 10.1007/s10994-013-5368-1 |