Scaling description of generalization with number of parameters in deep learning
Supervised deep learning involves the training of neural networks with a large number \(N\) of parameters. For large enough \(N\), in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increas...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2019-10 |
---|---|
Hauptverfasser: | , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Supervised deep learning involves the training of neural networks with a large number \(N\) of parameters. For large enough \(N\), in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as \(N\) grows past a certain threshold \(N^{*}\). Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with \(N\). We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations \(\|f_{N}-\bar{f}_{N}\|\sim N^{-1/4}\) of the neural net output function \(f_{N}\) around its expectation \(\bar{f}_{N}\). These affect the generalization error \(\epsilon_{N}\) for classification: under natural assumptions, it decays to a plateau value \(\epsilon_{\infty}\) in a power-law fashion \(\sim N^{-1/2}\). This description breaks down at a so-called jamming transition \(N=N^{*}\). At this threshold, we argue that \(\|f_{N}\|\) diverges. This result leads to a plausible explanation for the cusp in test error known to occur at \(N^{*}\). Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond \(N^{*}\), and averaging their outputs. |
---|---|
ISSN: | 2331-8422 |
DOI: | 10.48550/arxiv.1901.01608 |