The AdEMAMix Optimizer: Better, Faster, Older
Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations whic...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Momentum based optimizers are central to a wide range of machine learning
applications. These typically rely on an Exponential Moving Average (EMA) of
gradients, which decays exponentially the present contribution of older
gradients. This accounts for gradients being local linear approximations which
lose their relevance as the iterate moves along the loss landscape. This work
questions the use of a single EMA to accumulate past gradients and empirically
demonstrates how this choice can be sub-optimal: a single EMA cannot
simultaneously give a high weight to the immediate past, and a non-negligible
weight to older gradients. Building on this observation, we propose AdEMAMix, a
simple modification of the Adam optimizer with a mixture of two EMAs to better
take advantage of past gradients. Our experiments on language modeling and
image classification show -- quite surprisingly -- that gradients can stay
relevant for tens of thousands of steps. They help to converge faster, and
often to lower minima: e.g., a $1.3$B parameter AdEMAMix LLM trained on $101$B
tokens performs comparably to an AdamW model trained on $197$B tokens
($+95\%$). Moreover, our method significantly slows-down model forgetting
during training. Our work motivates further exploration of different types of
functions to leverage past gradients, beyond EMAs. |
---|---|
DOI: | 10.48550/arxiv.2409.03137 |