Self-Knowledge Distillation in Natural Language Processing
Since deep learning became a key player in natural language processing (NLP), many deep learning models have been showing remarkable performances in a variety of NLP tasks, and in some cases, they are even outperforming humans. Such high performance can be explained by efficient knowledge representa...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Since deep learning became a key player in natural language processing (NLP),
many deep learning models have been showing remarkable performances in a
variety of NLP tasks, and in some cases, they are even outperforming humans.
Such high performance can be explained by efficient knowledge representation of
deep learning models. While many methods have been proposed to learn more
efficient representation, knowledge distillation from pretrained deep networks
suggest that we can use more information from the soft target probability to
train other neural networks. In this paper, we propose a new knowledge
distillation method self-knowledge distillation, based on the soft target
probabilities of the training model itself, where multimode information is
distilled from the word embedding space right below the softmax layer. Due to
the time complexity, our method approximates the soft target probabilities. In
experiments, we applied the proposed method to two different and fundamental
NLP tasks: language model and neural machine translation. The experiment
results show that our proposed method improves performance on the tasks. |
---|---|
DOI: | 10.48550/arxiv.1908.01851 |