A Limitation of Gradient Descent Learning

Over decades, gradient descent has been applied to develop learning algorithm to train a neural network (NN). In this brief, a limitation of applying such algorithm to train an NN with persistent weight noise is revealed. Let V({\mathbf w}) be the performance measure of an ideal NN. V({\mathbf w}...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2020-06, Vol.31 (6), p.2227-2232
Hauptverfasser:	Sum, John, Leung, Chi-Sing, Ho, Kevin
Format:	Artikel
Sprache:	eng
Schlagworte:	Additive weight noise Additives Algorithms Artificial neural networks Computational modeling Computer simulation gradient descent algorithms Handwriting recognition Hardware Learning Learning curves Learning systems Linear programming Machine learning MNIST multiplicative weight noise Neural networks Noise Noise measurement Regression analysis Weight
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Over decades, gradient descent has been applied to develop learning algorithm to train a neural network (NN). In this brief, a limitation of applying such algorithm to train an NN with persistent weight noise is revealed. Let V({\mathbf w}) be the performance measure of an ideal NN. V({\mathbf w}) is applied to develop the gradient descent learning (GDL). With weight noise, the desired performance measure (denoted as {\mathcal{ J}}({\mathbf w}) ) is E[V(\tilde {\mathbf w})\|{\mathbf w}] , where \tilde {\mathbf w} is the noisy weight vector. Applying GDL to train an NN with weight noise, the actual learning objective is clearly not V({\mathbf w}) but another scalar function {\mathcal{ L}}({\mathbf w}) . For decades, there is a misconception that {\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w}) , and hence, the actual model attained by the GDL is the desired model. However, we show that it might not: 1) with persistent additive weight noise, the actual model attained is the desired model as {\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w}) ; and 2) with persistent multiplicative weight noise, the actual model attained is unlikely the desired model as {\mathcal{ L}}({\mathbf w}) \neq {\mathcal{ J}}({\mathbf w}) . Accordingly, the properties of the models attained as compared with the desired models are analyzed and the learning curves are sketched. Simulation results on 1) a simple regression problem and 2) the MNIST handwritten digit recognition are presented to support our claims.
ISSN:	2162-237X 2162-2388
DOI:	10.1109/TNNLS.2019.2927689