Knowledge Distillation of Russian Language Models with Reduction of Vocabulary
Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Today, transformer language models serve as a core component for majority of
natural language processing tasks. Industrial application of such models
requires minimization of computation time and memory footprint. Knowledge
distillation is one of approaches to address this goal. Existing methods in
this field are mainly focused on reducing the number of layers or dimension of
embeddings/hidden representations. Alternative option is to reduce the number
of tokens in vocabulary and therefore the embeddings matrix of the student
model. The main problem with vocabulary minimization is mismatch between input
sequences and output class distributions of a teacher and a student models. As
a result, it is impossible to directly apply KL-based knowledge distillation.
We propose two simple yet effective alignment techniques to make knowledge
distillation to the students with reduced vocabulary. Evaluation of distilled
models on a number of common benchmarks for Russian such as Russian SuperGLUE,
SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our
techniques allow to achieve compression from $17\times$ to $49\times$, while
maintaining quality of $1.7\times$ compressed student with the full-sized
vocabulary, but reduced number of Transformer layers only. We make our code and
distilled models available. |
---|---|
DOI: | 10.48550/arxiv.2205.02340 |