Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes
A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical o...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A fundamental property of deep learning normalization techniques, such as
batch normalization, is making the pre-normalization parameters scale
invariant. The intrinsic domain of such parameters is the unit sphere, and
therefore their gradient optimization dynamics can be represented via spherical
optimization with varying effective learning rate (ELR), which was studied
previously. However, the varying ELR may obscure certain characteristics of the
intrinsic loss landscape structure. In this work, we investigate the properties
of training scale-invariant neural networks directly on the sphere using a
fixed ELR. We discover three regimes of such training depending on the ELR
value: convergence, chaotic equilibrium, and divergence. We study these regimes
in detail both on a theoretical examination of a toy example and on a thorough
empirical analysis of real scale-invariant deep learning models. Each regime
has unique features and reflects specific properties of the intrinsic loss
landscape, some of which have strong parallels with previous research on both
regular and scale-invariant neural networks training. Finally, we demonstrate
how the discovered regimes are reflected in conventional training of normalized
networks and how they can be leveraged to achieve better optima. |
---|---|
DOI: | 10.48550/arxiv.2209.03695 |