Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks
In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this work, we investigate the effect of momentum on the optimisation
trajectory of gradient descent. We leverage a continuous-time approach in the
analysis of momentum gradient descent with step size $\gamma$ and momentum
parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda =
\frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path
and provides a simple acceleration rule. When training a $2$-layer diagonal
linear network in an overparametrised regression setting, we characterise the
recovered solution through an implicit regularisation problem. We then prove
that small values of $\lambda$ help to recover sparse solutions. Finally, we
give similar but weaker results for stochastic momentum gradient descent. We
provide numerical experiments which support our claims. |
---|---|
DOI: | 10.48550/arxiv.2403.05293 |