Student-friendly knowledge distillation

In knowledge distillation, the knowledge from the teacher model is often too complex for the student model to thoroughly process. However, good teachers in real life always simplify complex material before teaching it to students. Inspired by this fact, we propose student-friendly knowledge distilla...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2024-07, Vol.296, p.111915, Article 111915
Hauptverfasser: Yuan, Mengyang, Lang, Bo, Quan, Fengnan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In knowledge distillation, the knowledge from the teacher model is often too complex for the student model to thoroughly process. However, good teachers in real life always simplify complex material before teaching it to students. Inspired by this fact, we propose student-friendly knowledge distillation (SKD) to simplify teacher output into new knowledge representations, which makes the learning of the student model easier and more effective. SKD involves a softening processing and a learning simplifier. First, the softening processing uses the temperature hyperparameter to soften the output logits of the teacher model, which simplifies the output to some extent and makes it easier for the learning simplifier to process. The learning simplifier utilizes the attention mechanism to further simplify the knowledge of the teacher model and is jointly trained with the student model using the distillation loss, which means that the process of simplification is correlated with the training objective of the student model and ensures that the simplified new teacher knowledge representation is more suitable for the specific student model. Furthermore, since SKD does not change the form of the distillation loss, it can be easily combined with other distillation methods that are based on the logits or features of intermediate layers to enhance its effectiveness. Therefore, SKD has wide applicability. The experimental results on the CIFAR-100 and ImageNet datasets show that our method achieves state-of-the-art performance while maintaining high training efficiency. •Simplify the knowledge difficulty for the student model in knowledge distillation.•Use softening and attention mechanism to simplify the knowledge.•Achieve state-of-the-art performance while maintaining high training efficiency.•Easy to blend with other knowledge distillation models.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2024.111915