Lookahead Optimizer: k steps forward, 1 step back
The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, s...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The vast majority of successful deep neural networks are trained using
variants of stochastic gradient descent (SGD) algorithms. Recent attempts to
improve SGD can be broadly categorized into two approaches: (1) adaptive
learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes,
such as heavy-ball and Nesterov momentum. In this paper, we propose a new
optimization algorithm, Lookahead, that is orthogonal to these previous
approaches and iteratively updates two sets of weights. Intuitively, the
algorithm chooses a search direction by looking ahead at the sequence of fast
weights generated by another optimizer. We show that Lookahead improves the
learning stability and lowers the variance of its inner optimizer with
negligible computation and memory cost. We empirically demonstrate Lookahead
can significantly improve the performance of SGD and Adam, even with their
default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine
translation, and Penn Treebank. |
---|---|
DOI: | 10.48550/arxiv.1907.08610 |