Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash
Scaling the distributed deep learning to a massive GPU cluster level is challenging due to the instability of the large mini-batch training and the overhead of the gradient synchronization. We address the instability of the large mini-batch training with batch-size control and label smoothing. We ad...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Scaling the distributed deep learning to a massive GPU cluster level is
challenging due to the instability of the large mini-batch training and the
overhead of the gradient synchronization. We address the instability of the
large mini-batch training with batch-size control and label smoothing. We
address the overhead of the gradient synchronization with 2D-Torus all-reduce.
Specifically, 2D-Torus all-reduce arranges GPUs in a logical 2D grid and
performs a series of collective operation in different orientations. These two
techniques are implemented with Neural Network Libraries (NNL). We have
successfully trained ImageNet/ResNet-50 in 122 seconds without significant
accuracy loss on ABCI cluster. |
---|---|
DOI: | 10.48550/arxiv.1811.05233 |