If dropout limits trainable depth, does critical initialisation still matter? A large-scale statistical analysis on ReLU networks

•Recent work has shown that dropout limits the depth to which information can propagate through a neural network.•We investigate the effect of initialisation on training speed and generalisation within this depth limit.•We ask specifically: if dropout limits depth, does initialising critically still...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition letters 2020-10, Vol.138, p.95-105
Hauptverfasser: Pretorius, Arnu, van Biljon, Elan, van Niekerk, Benjamin, Eloff, Ryan, Reynard, Matthew, James, Steve, Rosman, Benjamin, Kamper, Herman, Kroon, Steve
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Recent work has shown that dropout limits the depth to which information can propagate through a neural network.•We investigate the effect of initialisation on training speed and generalisation within this depth limit.•We ask specifically: if dropout limits depth, does initialising critically still matter?•We conduct a large-scale controlled experiment and perform a statistical analysis of over 12 000 trained networks.•We show that at moderate depths, critical initialisation gives no performance gains over off-critical initialisations. Recent work in signal propagation theory has shown that dropout limits the depth to which information can propagate through a neural network. In this paper, we investigate the effect of initialisation on training speed and generalisation for ReLU networks within this depth limit. We ask the following research question: given that critical initialisation is crucial for training at large depth, if dropout limits the depth at which networks are trainable, does initialising critically still matter? We conduct a large-scale controlled experiment, and perform a statistical analysis of over 12 000 trained networks. We find that (1) trainable networks show no statistically significant difference in performance over a wide range of non-critical initialisations; (2) for initialisations that show a statistically significant difference, the net effect on performance is small; (3) only extreme initialisations (very small or very large) perform worse than criticality. These findings also apply to standard ReLU networks of moderate depth as a special case of zero dropout. Our results therefore suggest that, in the shallow-to-moderate depth setting, critical initialisation provides zero performance gains when compared to off-critical initialisations and that searching for off-critical initialisations that might improve training speed or generalisation, is likely to be a fruitless endeavour.
ISSN:0167-8655
1872-7344
DOI:10.1016/j.patrec.2020.06.025