Long Short-Term Sample Distillation
In the past decade, there has been substantial progress at training increasingly deep neural networks. Recent advances within the teacher--student training paradigm have established that information about past training updates show promise as a source of guidance during subsequent training steps. Ba...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In the past decade, there has been substantial progress at training
increasingly deep neural networks. Recent advances within the teacher--student
training paradigm have established that information about past training updates
show promise as a source of guidance during subsequent training steps. Based on
this notion, in this paper, we propose Long Short-Term Sample Distillation, a
novel training policy that simultaneously leverages multiple phases of the
previous training process to guide the later training updates to a neural
network, while efficiently proceeding in just one single generation pass. With
Long Short-Term Sample Distillation, the supervision signal for each sample is
decomposed into two parts: a long-term signal and a short-term one. The
long-term teacher draws on snapshots from several epochs ago in order to
provide steadfast guidance and to guarantee teacher--student differences, while
the short-term one yields more up-to-date cues with the goal of enabling
higher-quality updates. Moreover, the teachers for each sample are unique, such
that, overall, the model learns from a very diverse set of teachers.
Comprehensive experimental results across a range of vision and NLP tasks
demonstrate the effectiveness of this new training method. |
---|---|
DOI: | 10.48550/arxiv.2003.00739 |