Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than wel...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We propose NovoGrad, an adaptive stochastic gradient descent method with
layer-wise gradient normalization and decoupled weight decay. In our
experiments on neural networks for image classification, speech recognition,
machine translation, and language modeling, it performs on par or better than
well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is
robust to the choice of learning rate and weight initialization, (2) works well
in a large batch setting, and (3) has two times smaller memory footprint than
Adam. |
---|---|
DOI: | 10.48550/arxiv.1905.11286 |