Towards Zero-Shot Knowledge Distillation for Natural Language Processing
Knowledge Distillation (KD) is a common knowledge transfer algorithm used for model compression across a variety of deep learning based natural language processing (NLP) solutions. In its regular manifestations, KD requires access to the teacher's training data for knowledge transfer to the stu...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Knowledge Distillation (KD) is a common knowledge transfer algorithm used for
model compression across a variety of deep learning based natural language
processing (NLP) solutions. In its regular manifestations, KD requires access
to the teacher's training data for knowledge transfer to the student network.
However, privacy concerns, data regulations and proprietary reasons may prevent
access to such data. We present, to the best of our knowledge, the first work
on Zero-Shot Knowledge Distillation for NLP, where the student learns from the
much larger teacher without any task specific data. Our solution combines out
of domain data and adversarial training to learn the teacher's output
distribution. We investigate six tasks from the GLUE benchmark and demonstrate
that we can achieve between 75% and 92% of the teacher's classification score
(accuracy or F1) while compressing the model 30 times. |
---|---|
DOI: | 10.48550/arxiv.2012.15495 |