Teacher–student complementary sample contrastive distillation

Knowledge distillation (KD) is a widely adopted model compression technique for improving the performance of compact student models, by utilizing the “dark knowledge” of a large teacher model. However, previous studies have not adequately investigated the effectiveness of supervision from the teache...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neural networks 2024-02, Vol.170, p.176-189
Hauptverfasser: Bao, Zhiqiang, Huang, Zhenhua, Gou, Jianping, Du, Lan, Liu, Kang, Zhou, Jingtao, Chen, Yunwen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Knowledge distillation (KD) is a widely adopted model compression technique for improving the performance of compact student models, by utilizing the “dark knowledge” of a large teacher model. However, previous studies have not adequately investigated the effectiveness of supervision from the teacher model, and overconfident predictions in the student model may degrade its performance. In this work, we propose a novel framework, Teacher–Student Complementary Sample Contrastive Distillation (TSCSCD), that alleviate these challenges. TSCSCD consists of three key components: Contrastive Sample Hardness (CSH), Supervision Signal Correction (SSC), and Student Self-Learning (SSL). Specifically, CSH evaluates the teacher’s supervision for each sample by comparing the predictions of two compact models, one distilled from the teacher and the other trained from scratch. SSC corrects weak supervision according to CSH, while SSL employs integrated learning among multi-classifiers to regularize overconfident predictions. Extensive experiments on four real-world datasets demonstrate that TSCSCD outperforms recent state-of-the-art knowledge distillation techniques. •We present a supervision correction component for knowledge distillation (KD).•We propose a self-learning component that constrains the overconfident prediction.•The effectiveness of framework is evaluated on four real-world datasets.
ISSN:0893-6080
1879-2782
DOI:10.1016/j.neunet.2023.11.036