Stochastic cubic-regularized policy gradient method

Policy-based reinforcement learning methods have achieved great achievements in real-world decision-making problems. However, the theoretical understanding of policy-based methods is still limited. Specifically, existing works mainly focus on first-order stationary point policies (FOSPs); in some ve...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2022-11, Vol.255, p.109687, Article 109687
Hauptverfasser: Wang, Pengfei, Wang, Hongyu, Zheng, Nenggan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Policy-based reinforcement learning methods have achieved great achievements in real-world decision-making problems. However, the theoretical understanding of policy-based methods is still limited. Specifically, existing works mainly focus on first-order stationary point policies (FOSPs); in some very special reinforcement learning settings (e.g., tabular case and function approximation with restricted parametric policy classes) some works consider globally optimal policy. It is well-known that FOSPs could be undesirable local optima or saddle points, and obtaining a global optimum is generally NP-hard. In this paper, we propose a policy gradient method that provably converges to second-order stationary point policies (SOSPs) for any differentiable policy classes. The proposed method is computationally efficient, and it judiciously uses cubic-regularized subroutines to escape saddle points while at the same time minimizing the Hessian-based computations. We prove that the method enjoys the sample complexity of O˜(ϵ−3.5), which improves upon the current optimal complexity O˜ϵ−4.5. Finally, experimental results are provided to demonstrate the effectiveness of the method. •A New Algorithm. We propose a Stochastic Cubic-Regularized Policy Gradient (SCRPG) method in which the second-order subroutine is invoked only when the iterate arrives near a FOSP. If the iterate is a SOSP, SCR-PG terminates early and outputs the iterate, otherwise potentially escapes saddle points. Moreover, the new method is simple to use, since it only leverages stochastic gradient and Hessian-vector product evaluations which are both implementable in linear time with respect to the problem dimension, and only needs to approximately solve the cubic regularization rather than solve it exactly. The new method achieves the good properties of previous works Nesterov and Polyak (2006); Tripuraneni et al. (2018); Reddi et al. (2018) while avoiding their drawbacks.•Theoretical Guarantee. We provide a non-asymptotic analysis of SCR-PG’s complexp ity with high probability. We prove that, under mild assumptions, to find an {ϵ,ρϵ}-approximate SOSP where ϵ is a predefined precision accuracy and ρ is the Lipschitz constant of the Hessian matrix of the expected return (see Proposition 1), the sample complexity of SCR-PG is at most O˜ϵ−3.5. Note that this complexity is better than the complexity Oϵ−4 of REINFORCE for finding FOSPs and the best-known complexity O˜(ϵ−4.5) for finding SOSPs proposed in Ya
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2022.109687