A Limitation of Gradient Descent Learning
Over decades, gradient descent has been applied to develop learning algorithm to train a neural network (NN). In this brief, a limitation of applying such algorithm to train an NN with persistent weight noise is revealed. Let V({\mathbf w}) be the performance measure of an ideal NN. V({\mathbf w}...
Gespeichert in:
Veröffentlicht in: | IEEE transaction on neural networks and learning systems 2020-06, Vol.31 (6), p.2227-2232 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Over decades, gradient descent has been applied to develop learning algorithm to train a neural network (NN). In this brief, a limitation of applying such algorithm to train an NN with persistent weight noise is revealed. Let V({\mathbf w}) be the performance measure of an ideal NN. V({\mathbf w}) is applied to develop the gradient descent learning (GDL). With weight noise, the desired performance measure (denoted as {\mathcal{ J}}({\mathbf w}) ) is E[V(\tilde {\mathbf w})|{\mathbf w}] , where \tilde {\mathbf w} is the noisy weight vector. Applying GDL to train an NN with weight noise, the actual learning objective is clearly not V({\mathbf w}) but another scalar function {\mathcal{ L}}({\mathbf w}) . For decades, there is a misconception that {\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w}) , and hence, the actual model attained by the GDL is the desired model. However, we show that it might not: 1) with persistent additive weight noise, the actual model attained is the desired model as {\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w}) ; and 2) with persistent multiplicative weight noise, the actual model attained is unlikely the desired model as {\mathcal{ L}}({\mathbf w}) \neq {\mathcal{ J}}({\mathbf w}) . Accordingly, the properties of the models attained as compared with the desired models are analyzed and the learning curves are sketched. Simulation results on 1) a simple regression problem and 2) the MNIST handwritten digit recognition are presented to support our claims. |
---|---|
ISSN: | 2162-237X 2162-2388 |
DOI: | 10.1109/TNNLS.2019.2927689 |