Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	APSIPA transactions on signal and information processing 2021-11, Vol.10 (1)
Hauptverfasser:	Chou, Hsing-Hung, Chiu, Ching-Te, Liao, Yi-Ping
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Approximation Artificial neural networks Classification Computation Datasets Decomposition Deep learning Distillation Electronic devices Engineering Feature extraction Image classification Image segmentation Internet of Things Knowledge Methods Neural networks Object recognition Original Paper Semantics Signal Processing Teachers Technology
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\times$ compression rate and $5.27\times$ computation rate. In addition, we conducted experiments using the ImageNet$64\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\times$ compression rate and $2.05\times$ computation rate.
ISSN:	2048-7703 2048-7703
DOI:	10.1017/ATSIP.2021.16