Scaling description of generalization with number of parameters in deep learning

Supervised deep learning involves the training of neural networks with a large number \(N\) of parameters. For large enough \(N\), in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increas...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2019-10
Hauptverfasser:	Geiger, Mario, Jacot, Arthur, Spigler, Stefano, Franck, Gabriel, Sagun, Levent, d'Ascoli, Stéphane, Biroli, Giulio, Hongler, Clément, Wyart, Matthieu
Format:	Artikel
Sprache:	eng
Schlagworte:	Asymptotic properties Computer Science - Learning Data points Deep learning Divergence Empirical analysis Initial conditions Jamming Machine learning Mathematical models Neural networks Parameters Physics - Disordered Systems and Neural Networks Predictions Regularization Variation
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Supervised deep learning involves the training of neural networks with a large number \(N\) of parameters. For large enough \(N\), in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as \(N\) grows past a certain threshold \(N^{}\). Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with \(N\). We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations \(\\|f_{N}-\bar{f}_{N}\\|\sim N^{-1/4}\) of the neural net output function \(f_{N}\) around its expectation \(\bar{f}_{N}\). These affect the generalization error \(\epsilon_{N}\) for classification: under natural assumptions, it decays to a plateau value \(\epsilon_{\infty}\) in a power-law fashion \(\sim N^{-1/2}\). This description breaks down at a so-called jamming transition \(N=N^{}\). At this threshold, we argue that \(\\|f_{N}\\|\) diverges. This result leads to a plausible explanation for the cusp in test error known to occur at \(N^{}\). Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond \(N^{}\), and averaging their outputs.
ISSN:	2331-8422
DOI:	10.48550/arxiv.1901.01608