Knowledge distillation via Noisy Feature Reconstruction

As a promising model compression technique, knowledge distillation aims to supervise the training of small networks with advanced knowledge from large networks to improve the performance of small networks. However, the performance improvement of student models is generally limited by the capacity ga...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2024-12, Vol.257, p.124837, Article 124837
Hauptverfasser: Shi, Chaokun, Hao, Yuexing, Li, Gongyan, Xu, Shaoyun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:As a promising model compression technique, knowledge distillation aims to supervise the training of small networks with advanced knowledge from large networks to improve the performance of small networks. However, the performance improvement of student models is generally limited by the capacity gap between the teacher and student models. Previous approaches typically aim to have students mimic teacher features, which may lead to suboptimal results. In this paper, we investigate the effect of introducing random noise in intermediate features on knowledge distillation. The noise helps students to focus on learning key information and thus improves feature representation. Specifically, we propose Noisy Feature Distillation (NFD), which adds noise to student features and uses a convolutional block to reconstruct features under the guidance of teacher features. In this way, we create a noisy environment that allows the student and teacher networks to focus on objects of interest in different tasks, thereby improving the robustness of the knowledge and its transfer process. In addition, we introduce the spatial attention of teacher features to modulate the noise distribution and guide students to focus on key pixels, further improving student performance. Extensive experiments show that our strategy outperforms other methods in various tasks such as classification, object detection, instance segmentation, and semantic segmentation. Significant performance gains are achieved in both homogeneous and heterogeneous distillation. And our strategy excels in small object detection. •We explore the impact of random noise on the performance of distillation.•Appropriate noise helps the student features gain stronger representation power.•We propose a knowledge distillation method based on noisy feature reconstruction.•Our method is applicable to various tasks, e.g., classification and dense prediction.•Our method yields excellent distillation results, especially in heterogeneous cases.
ISSN:0957-4174
DOI:10.1016/j.eswa.2024.124837