StableKD: Breaking Inter-block Optimization Entanglement for Stable Knowledge Distillation
Knowledge distillation (KD) has been recognized as an effective tool to compress and accelerate models. However, current KD approaches generally suffer from an accuracy drop and/or an excruciatingly long distillation process. In this paper, we tackle the issue by first providing a new insight into a...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Knowledge distillation (KD) has been recognized as an effective tool to
compress and accelerate models. However, current KD approaches generally suffer
from an accuracy drop and/or an excruciatingly long distillation process. In
this paper, we tackle the issue by first providing a new insight into a
phenomenon that we call the Inter-Block Optimization Entanglement (IBOE), which
makes the conventional end-to-end KD approaches unstable with noisy gradients.
We then propose StableKD, a novel KD framework that breaks the IBOE and
achieves more stable optimization. StableKD distinguishes itself through two
operations: Decomposition and Recomposition, where the former divides a pair of
teacher and student networks into several blocks for separate distillation, and
the latter progressively merges them back, evolving towards end-to-end
distillation. We conduct extensive experiments on CIFAR100, Imagewoof, and
ImageNet datasets with various teacher-student pairs. Compared to other KD
approaches, our simple yet effective StableKD greatly boosts the model accuracy
by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms them
with only 40% of the training data. |
---|---|
DOI: | 10.48550/arxiv.2312.13223 |