From Gradient Clipping to Normalization for Heavy Tailed SGD
Recent empirical evidence indicates that many machine learning applications involve heavy-tailed gradient noise, which challenges the standard assumptions of bounded variance in stochastic optimization. Gradient clipping has emerged as a popular tool to handle this heavy-tailed noise, as it achieves...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent empirical evidence indicates that many machine learning applications
involve heavy-tailed gradient noise, which challenges the standard assumptions
of bounded variance in stochastic optimization. Gradient clipping has emerged
as a popular tool to handle this heavy-tailed noise, as it achieves good
performance in this setting both theoretically and practically. However, our
current theoretical understanding of non-convex gradient clipping has three
main shortcomings. First, the theory hinges on large, increasing clipping
thresholds, which are in stark contrast to the small constant clipping
thresholds employed in practice. Second, clipping thresholds require knowledge
of problem-dependent parameters to guarantee convergence. Lastly, even with
this knowledge, current sampling complexity upper bounds for the method are
sub-optimal in nearly all parameters. To address these issues, we study
convergence of Normalized SGD (NSGD). First, we establish a parameter-free
sample complexity for NSGD of
$\mathcal{O}\left(\varepsilon^{-\frac{2p}{p-1}}\right)$ to find an
$\varepsilon$-stationary point. Furthermore, we prove tightness of this result,
by providing a matching algorithm-specific lower bound. In the setting where
all problem parameters are known, we show this complexity is improved to
$\mathcal{O}\left(\varepsilon^{-\frac{3p-2}{p-1}}\right)$, matching the
previously known lower bound for all first-order methods in all problem
dependent parameters. Finally, we establish high-probability convergence of
NSGD with a mild logarithmic dependence on the failure probability. Our work
complements the studies of gradient clipping under heavy tailed noise improving
the sample complexities of existing algorithms and offering an alternative
mechanism to achieve high probability convergence. |
---|---|
DOI: | 10.48550/arxiv.2410.13849 |