Solvable Model for Inheriting the Regularization through Knowledge Distillation
Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, PMLR 145:809-846, 2022 In recent years the empirical success of transfer learning with neural networks has stimulated an increasing interest in obtaining a theoretical understanding of its core properties. Knowledge dist...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Proceedings of the 2nd Mathematical and Scientific Machine
Learning Conference, PMLR 145:809-846, 2022 In recent years the empirical success of transfer learning with neural
networks has stimulated an increasing interest in obtaining a theoretical
understanding of its core properties. Knowledge distillation where a smaller
neural network is trained using the outputs of a larger neural network is a
particularly interesting case of transfer learning. In the present work, we
introduce a statistical physics framework that allows an analytic
characterization of the properties of knowledge distillation (KD) in shallow
neural networks. Focusing the analysis on a solvable model that exhibits a
non-trivial generalization gap, we investigate the effectiveness of KD. We are
able to show that, through KD, the regularization properties of the larger
teacher model can be inherited by the smaller student and that the yielded
generalization performance is closely linked to and limited by the optimality
of the teacher. Finally, we analyze the double descent phenomenology that can
arise in the considered KD setting. |
---|---|
DOI: | 10.48550/arxiv.2012.00194 |