Student-t policy in reinforcement learning to acquire global optimum of robot control
This paper proposes an actor-critic algorithm with a policy parameterized by student-t distribution, named student-t policy, to enhance learning performance, mainly in terms of reachability on global optimum for tasks to be learned. The actor-critic algorithm is one of the policy-gradient methods in...
Gespeichert in:
Veröffentlicht in: | Applied intelligence (Dordrecht, Netherlands) Netherlands), 2019-12, Vol.49 (12), p.4335-4347 |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper proposes an actor-critic algorithm with a policy parameterized by student-t distribution, named student-t policy, to enhance learning performance, mainly in terms of reachability on global optimum for tasks to be learned. The actor-critic algorithm is one of the policy-gradient methods in reinforcement learning, and is proved to learn the policy converging on one of the local optima. To avoid the local optima, an exploration ability to escape it and a conservative learning not to be trapped in it are deemed to be empirically effective. The conventional policy parameterized by a normal distribution, however, fundamentally lacks these abilities. The state-of-the-art methods can somewhat but not perfectly compensate for them. Conversely, heavy-tailed distribution, including student-t distribution, possesses an excellent exploration ability, which is called Lévy flight for modeling efficient feed detection of animals. Another property of the heavy tail is its robustness to outliers. Namely, conservative learning is performed to not be trapped in the local optima even when it takes extreme actions. These desired properties of the student-t policy enhance the possibility of the agents reaching the global optimum. Indeed, the student-t policy outperforms the conventional policy in four types of simulations, two of which are difficult to learn faster without sufficient exploration and the others have the local optima. |
---|---|
ISSN: | 0924-669X 1573-7497 |
DOI: | 10.1007/s10489-019-01510-8 |