Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative m...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Machine learning 2013-06, Vol.91 (3), p.325-349
Hauptverfasser: Gheshlaghi Azar, Mohammad, Munos, Rémi, Kappen, Hilbert J.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γ ∈[0,1) only O ( N log( N / δ )/((1− γ ) 3 ε 2 )) state-transition samples are required to find an ε -optimal estimation of the action-value function with the probability (w.p.) 1− δ . Further, we prove that, for small values of ε , an order of O ( N log( N / δ )/((1− γ ) 3 ε 2 )) samples is required to find an ε -optimal policy w.p. 1− δ . We also prove a matching lower bound of Θ ( N log( N / δ )/((1− γ ) 3 ε 2 )) on the sample complexity of estimating the optimal action-value function with ε accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of  N , ε , δ and 1/(1− γ ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1− γ ).
ISSN:0885-6125
1573-0565
DOI:10.1007/s10994-013-5368-1