Distilling the Knowledge in Data Pruning
With the increasing size of datasets used for training neural networks, data pruning becomes an attractive field of research. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regime...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With the increasing size of datasets used for training neural networks, data
pruning becomes an attractive field of research. However, most current data
pruning algorithms are limited in their ability to preserve accuracy compared
to models trained on the full data, especially in high pruning regimes. In this
paper we explore the application of data pruning while incorporating knowledge
distillation (KD) when training on a pruned subset. That is, rather than
relying solely on ground-truth labels, we also use the soft predictions from a
teacher network pre-trained on the complete data. By integrating KD into
training, we demonstrate significant improvement across datasets, pruning
methods, and on all pruning fractions. We first establish a theoretical
motivation for employing self-distillation to improve training on pruned data.
Then, we empirically make a compelling and highly practical observation: using
KD, simple random pruning is comparable or superior to sophisticated pruning
methods across all pruning regimes. On ImageNet for example, we achieve
superior accuracy despite training on a random subset of only 50% of the data.
Additionally, we demonstrate a crucial connection between the pruning factor
and the optimal knowledge distillation weight. This helps mitigate the impact
of samples with noisy labels and low-quality images retained by typical pruning
algorithms. Finally, we make an intriguing observation: when using lower
pruning fractions, larger teachers lead to accuracy degradation, while
surprisingly, employing teachers with a smaller capacity than the student's may
improve results. Our code will be made available. |
---|---|
DOI: | 10.48550/arxiv.2403.07854 |