High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise
In this paper, we study the convergence of the Adaptive Moment Estimation (Adam) algorithm under unconstrained non-convex smooth stochastic optimizations. Despite the widespread usage in machine learning areas, its theoretical properties remain limited. Prior researches primarily investigated Adam...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we study the convergence of the Adaptive Moment Estimation
(Adam) algorithm under unconstrained non-convex smooth stochastic
optimizations. Despite the widespread usage in machine learning areas, its
theoretical properties remain limited. Prior researches primarily investigated
Adam's convergence from an expectation view, often necessitating strong
assumptions like uniformly stochastic bounded gradients or problem-dependent
knowledge in prior. As a result, the applicability of these findings in
practical real-world scenarios has been constrained. To overcome these
limitations, we provide a deep analysis and show that Adam could converge to
the stationary point in high probability with a rate of $\mathcal{O}\left({\rm
poly}(\log T)/\sqrt{T}\right)$ under coordinate-wise "affine" variance noise,
not requiring any bounded gradient assumption and any problem-dependent
knowledge in prior to tune hyper-parameters. Additionally, it is revealed that
Adam confines its gradients' magnitudes within an order of
$\mathcal{O}\left({\rm poly}(\log T)\right)$. Finally, we also investigate a
simplified version of Adam without one of the corrective terms and obtain a
convergence rate that is adaptive to the noise level. |
---|---|
DOI: | 10.48550/arxiv.2311.02000 |