A Methodology Establishing Linear Convergence of Adaptive Gradient Methods under PL Inequality
Adaptive gradient-descent optimizers are the standard choice for training neural network models. Despite their faster convergence than gradient-descent and remarkable performance in practice, the adaptive optimizers are not as well understood as vanilla gradient-descent. A reason is that the dynamic...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Adaptive gradient-descent optimizers are the standard choice for training
neural network models. Despite their faster convergence than gradient-descent
and remarkable performance in practice, the adaptive optimizers are not as well
understood as vanilla gradient-descent. A reason is that the dynamic update of
the learning rate that helps in faster convergence of these methods also makes
their analysis intricate. Particularly, the simple gradient-descent method
converges at a linear rate for a class of optimization problems, whereas the
practically faster adaptive gradient methods lack such a theoretical guarantee.
The Polyak-{\L}ojasiewicz (PL) inequality is the weakest known class, for which
linear convergence of gradient-descent and its momentum variants has been
proved. Therefore, in this paper, we prove that AdaGrad and Adam, two
well-known adaptive gradient methods, converge linearly when the cost function
is smooth and satisfies the PL inequality. Our theoretical framework follows a
simple and unified approach, applicable to both batch and stochastic gradients,
which can potentially be utilized in analyzing linear convergence of other
variants of Adam. |
---|---|
DOI: | 10.48550/arxiv.2407.12629 |