How Does Learning Rate Decay Help Modern Neural Networks?
Learning rate decay (lrDecay) is a \emph{de facto} technique for training modern neural networks. It starts with a large learning rate and then decays it multiple times. It is empirically observed to help both optimization and generalization. Common beliefs in how lrDecay works come from the optimiz...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Learning rate decay (lrDecay) is a \emph{de facto} technique for training
modern neural networks. It starts with a large learning rate and then decays it
multiple times. It is empirically observed to help both optimization and
generalization. Common beliefs in how lrDecay works come from the optimization
analysis of (Stochastic) Gradient Descent: 1) an initially large learning rate
accelerates training or helps the network escape spurious local minima; 2)
decaying the learning rate helps the network converge to a local minimum and
avoid oscillation. Despite the popularity of these common beliefs, experiments
suggest that they are insufficient in explaining the general effectiveness of
lrDecay in training modern neural networks that are deep, wide, and nonconvex.
We provide another novel explanation: an initially large learning rate
suppresses the network from memorizing noisy data while decaying the learning
rate improves the learning of complex patterns. The proposed explanation is
validated on a carefully-constructed dataset with tractable pattern complexity.
And its implication, that additional patterns learned in later stages of
lrDecay are more complex and thus less transferable, is justified in real-world
datasets. We believe that this alternative explanation will shed light into the
design of better training strategies for modern neural networks. |
---|---|
DOI: | 10.48550/arxiv.1908.01878 |