Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. The student models are typically compact transformers with fewer parameters, while expensive operations such as self-attention persist. Therefore, the impr...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Distilling state-of-the-art transformer models into lightweight student
models is an effective way to reduce computation cost at inference time. The
student models are typically compact transformers with fewer parameters, while
expensive operations such as self-attention persist. Therefore, the improved
inference speed may still be unsatisfactory for real-time or high-volume use
cases. In this paper, we aim to further push the limit of inference speed by
distilling teacher models into bigger, sparser student models -- bigger in that
they scale up to billions of parameters; sparser in that most of the model
parameters are n-gram embeddings. Our experiments on six single-sentence text
classification tasks show that these student models retain 97% of the
RoBERTa-Large teacher performance on average, and meanwhile achieve up to 600x
speed-up on both GPUs and CPUs at inference time. Further investigation reveals
that our pipeline is also helpful for sentence-pair classification tasks, and
in domain generalization settings. |
---|---|
DOI: | 10.48550/arxiv.2110.08536 |