A Limitation of Gradient Descent Learning

Over decades, gradient descent has been applied to develop learning algorithm to train a neural network (NN). In this brief, a limitation of applying such algorithm to train an NN with persistent weight noise is revealed. Let V({\mathbf w}) be the performance measure of an ideal NN. V({\mathbf w}...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transaction on neural networks and learning systems 2020-06, Vol.31 (6), p.2227-2232
Hauptverfasser: Sum, John, Leung, Chi-Sing, Ho, Kevin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Over decades, gradient descent has been applied to develop learning algorithm to train a neural network (NN). In this brief, a limitation of applying such algorithm to train an NN with persistent weight noise is revealed. Let V({\mathbf w}) be the performance measure of an ideal NN. V({\mathbf w}) is applied to develop the gradient descent learning (GDL). With weight noise, the desired performance measure (denoted as {\mathcal{ J}}({\mathbf w}) ) is E[V(\tilde {\mathbf w})|{\mathbf w}] , where \tilde {\mathbf w} is the noisy weight vector. Applying GDL to train an NN with weight noise, the actual learning objective is clearly not V({\mathbf w}) but another scalar function {\mathcal{ L}}({\mathbf w}) . For decades, there is a misconception that {\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w}) , and hence, the actual model attained by the GDL is the desired model. However, we show that it might not: 1) with persistent additive weight noise, the actual model attained is the desired model as {\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w}) ; and 2) with persistent multiplicative weight noise, the actual model attained is unlikely the desired model as {\mathcal{ L}}({\mathbf w}) \neq {\mathcal{ J}}({\mathbf w}) . Accordingly, the properties of the models attained as compared with the desired models are analyzed and the learning curves are sketched. Simulation results on 1) a simple regression problem and 2) the MNIST handwritten digit recognition are presented to support our claims.
ISSN:2162-237X
2162-2388
DOI:10.1109/TNNLS.2019.2927689