Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks
Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Neural network compression techniques have become increasingly popular as
they can drastically reduce the storage and computation requirements for very
large networks. Recent empirical studies have illustrated that even simple
pruning strategies can be surprisingly effective, and several theoretical
studies have shown that compressible networks (in specific senses) should
achieve a low generalization error. Yet, a theoretical characterization of the
underlying cause that makes the networks amenable to such simple compression
schemes is still missing. In this study, we address this fundamental question
and reveal that the dynamics of the training algorithm has a key role in
obtaining such compressible networks. Focusing our attention on stochastic
gradient descent (SGD), our main contribution is to link compressibility to two
recently established properties of SGD: (i) as the network size goes to
infinity, the system can converge to a mean-field limit, where the network
weights behave independently, (ii) for a large step-size/batch-size ratio, the
SGD iterates can converge to a heavy-tailed stationary distribution. In the
case where these two phenomena occur simultaneously, we prove that the networks
are guaranteed to be '$\ell_p$-compressible', and the compression errors of
different pruning techniques (magnitude, singular value, or node pruning)
become arbitrarily small as the network size increases. We further prove
generalization bounds adapted to our theoretical framework, which indeed
confirm that the generalization error will be lower for more compressible
networks. Our theory and numerical study on various neural networks show that
large step-size/batch-size ratios introduce heavy-tails, which, in combination
with overparametrization, result in compressibility. |
---|---|
DOI: | 10.48550/arxiv.2106.03795 |