Concurrent Training and Layer Pruning of Deep Neural Networks
We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training. In contrast to weight or filter-level pruning, layer pruning reduces the harder to parallelize sequential computation of a neural network. We employ a structure u...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We propose an algorithm capable of identifying and eliminating irrelevant
layers of a neural network during the early stages of training. In contrast to
weight or filter-level pruning, layer pruning reduces the harder to parallelize
sequential computation of a neural network. We employ a structure using
residual connections around nonlinear network sections that allow the flow of
information through the network once a nonlinear section is pruned. Our
approach is based on variational inference principles using Gaussian scale
mixture priors on the neural network weights and allows for substantial cost
savings during both training and inference. More specifically, the variational
posterior distribution of scalar Bernoulli random variables multiplying a layer
weight matrix of its nonlinear sections is learned, similarly to adaptive
layer-wise dropout. To overcome challenges of concurrent learning and pruning
such as premature pruning and lack of robustness with respect to weight
initialization or the size of the starting network, we adopt the "flattening"
hyper-prior on the prior parameters. We prove that, as a result of its usage,
the solutions of the resulting optimization problem describe deterministic
networks with parameters of the posterior distribution at either 0 or 1. We
formulate a projected SGD algorithm and prove its convergence to such a
solution using stochastic approximation results. In particular, we prove
conditions that lead to a layer's weights converging to zero and derive
practical pruning conditions from the theoretical results. The proposed
algorithm is evaluated on the MNIST, CIFAR-10 and ImageNet datasets and common
LeNet, VGG16 and ResNet architectures. The simulations demonstrate that our
method achieves state-of the-art performance for layer pruning at reduced
computational cost in distinction to competing methods due to the concurrent
training and pruning. |
---|---|
DOI: | 10.48550/arxiv.2406.04549 |