Learning Policies for Markov Decision Processes From Data

We consider the problem of learning a policy for a Markov decision process consistent with data captured on the state-action pairs followed by the policy. We parameterize the policy using features associated with the state-action pairs. The features can be handcrafted or defined using kernel functio...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on automatic control 2019-06, Vol.64 (6), p.2298-2309
Hauptverfasser:	Hanawal, Manjesh Kumar, Liu, Hao, Zhu, Henghui, Paschalidis, Ioannis Ch
Format:	Artikel
Sprache:	eng
Schlagworte:	Complexity theory Decision theory Hilbert space Kernel Kernel functions Learning Learning (artificial intelligence) Logistics Machine learning Markov analysis Markov chains Markov decision processes (MDPs) Markov processes Policies Process control regression Regression analysis reinforcement learning Sensitivity analysis Supervised learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We consider the problem of learning a policy for a Markov decision process consistent with data captured on the state-action pairs followed by the policy. We parameterize the policy using features associated with the state-action pairs. The features can be handcrafted or defined using kernel functions in a reproducing kernel Hilbert space. In either case, the set of features can be large and only a small, unknown subset may need to be used to fit a specific policy to the data. The parameters of such a policy are recovered using 1-regularized logistic regression. We establish bounds on the difference between the average reward of the estimated and the unknown original policies (regret) in terms of the generalization error and the ergodic coefficient of the underlying Markov chain. To that end, we combine sample complexity theory and sensitivity analysis of the stationary distribution of Markov chains. Our analysis suggests that to achieve regret within order O(√ε), it suffices to use training sample size of the order of Ω(log n · poly(1/ε)), where n is the number of the features. We demonstrate the effectiveness of our method on a synthetic robot navigation example.
ISSN:	0018-9286 1558-2523
DOI:	10.1109/TAC.2018.2866455