Kronecker Decomposition for GPT Compression
GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success of GPT is mostly attributed to its pre-training on huge amoun...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | GPT is an auto-regressive Transformer-based pre-trained language model which
has attracted a lot of attention in the natural language processing (NLP)
domain due to its state-of-the-art performance in several downstream tasks. The
success of GPT is mostly attributed to its pre-training on huge amount of data
and its large number of parameters (from ~100M to billions of parameters).
Despite the superior performance of GPT (especially in few-shot or zero-shot
setup), this overparameterized nature of GPT can be very prohibitive for
deploying this model on devices with limited computational power or memory.
This problem can be mitigated using model compression techniques; however,
compressing GPT models has not been investigated much in the literature. In
this work, we use Kronecker decomposition to compress the linear mappings of
the GPT-22 model. Our Kronecker GPT-2 model (KnGPT2) is initialized based on
the Kronecker decomposed version of the GPT-2 model and then is undergone a
very light pre-training on only a small portion of the training data with
intermediate layer knowledge distillation (ILKD). Finally, our KnGPT2 is
fine-tuned on down-stream tasks using ILKD as well. We evaluate our model on
both language modeling and General Language Understanding Evaluation benchmark
tasks and show that with more efficient pre-training and similar number of
parameters, our KnGPT2 outperforms the existing DistilGPT2 model significantly. |
---|---|
DOI: | 10.48550/arxiv.2110.08152 |