Mini-batch Serialization: CNN Training with Inter-layer Data Reuse
Training convolutional neural networks (CNNs) requires intense computations and high memory bandwidth. We find that bandwidth today is over-provisioned because most memory accesses in CNN training can be eliminated by rearranging computation to better utilize on-chip buffers and avoid traffic result...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Training convolutional neural networks (CNNs) requires intense computations
and high memory bandwidth. We find that bandwidth today is over-provisioned
because most memory accesses in CNN training can be eliminated by rearranging
computation to better utilize on-chip buffers and avoid traffic resulting from
large per-layer memory footprints. We introduce the MBS CNN training approach
that significantly reduces memory traffic by partially serializing mini-batch
processing across groups of layers. This optimizes reuse within on-chip buffers
and balances both intra-layer and inter-layer reuse. We also introduce the
WaveCore CNN training accelerator that effectively trains CNNs in the MBS
approach with high functional-unit utilization. Combined, WaveCore and MBS
reduce DRAM traffic by 75%, improve performance by 53%, and save 26% system
energy for modern deep CNN training compared to conventional training
mechanisms and accelerators. |
---|---|
DOI: | 10.48550/arxiv.1810.00307 |