An adaptive teacher–student learning algorithm with decomposed knowledge distillation for on-edge intelligence
In case the spatial shape of the feature maps of the teacher in feature-based knowledge distillation (KD) is significantly greater than the student model, first, they cannot be compared directly. Second, the knowledge of these complex feature maps cannot be quite apprehensible for the student. This...
Gespeichert in:
Veröffentlicht in: | Engineering applications of artificial intelligence 2023-01, Vol.117, p.105560, Article 105560 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In case the spatial shape of the feature maps of the teacher in feature-based knowledge distillation (KD) is significantly greater than the student model, first, they cannot be compared directly. Second, the knowledge of these complex feature maps cannot be quite apprehensible for the student. This paper proposed a new KD, in which Tucker decomposition was used to decompose the large-dimension feature maps of a teacher to obtain core tensors from the feature maps of the teacher. The knowledge of these tensors can be easily understood by students due to their low complexity. Furthermore, in the proposed KD, an adaptor function is suggested, which balances the spatial shape of the core tensors of the teacher and student and helps compare them using a convolution regressor. Finally, a hybrid loss based on adaptor function is suggested to distill the knowledge of the core tensors of the teacher to the student. Both teacher and student models were implemented on smartphones used as edge devices, and the experiments were evaluated in terms of recognition rate and complexity. According to the results, the student model designed by ResNet-18 architecture has ∼65.44 million fewer parameters, ∼6.45 GFLOPs less computational complexity, ∼1.12 G less GPU memory use, and ∼265.67 times greater compression rate than its teacher model designed by ResNet-50 architecture. While the recognition rate of the student model merely dropped down to 1.5% in the benchmark dataset.
•Introducing a teacher–student learning with decomposed knowledge distillation.•Distillation is based on the decomposition of feature maps within the middle layers.•Implementing and evaluating the student model on a smartphone as an edge.•The results of the student model showed competitive performance on benchmark datasets. |
---|---|
ISSN: | 0952-1976 1873-6769 |
DOI: | 10.1016/j.engappai.2022.105560 |