Multi-teacher distillation BERT model in NLU tasks

Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model.The method of "multi-teacher distillation" can further improve the performance of the student model, while the traditional &q...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:大数据 2024-05, Vol.10, p.119-132
Hauptverfasser: Jialai SHI, Weibin GUO
Format: Artikel
Sprache:chi
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model.The method of "multi-teacher distillation" can further improve the performance of the student model, while the traditional "one-to-one" mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features.The "one-tomany" mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation, and help students master the grammar, reference and other knowledge in the middle layer of the teacher model.Experiments on several data sets in GLUE show that the student model retains 93.9% of the average inference accuracy of the teacher model, while only accounting for 41.5% of the average parameter size of the teacher model.
ISSN:2096-0271