DQ-STP: An Efficient Sparse On-Device Training Processor Based on Low-Rank Decomposition and Quantization for DNN
Due to the bottleneck problems such as scenario-varying application, significant data communication overhead and privacy protection between off-line training and on-line inference, intelligent edge devices capable of adaptively fine-tuning the deep neural network (DNN) models for specific tasks have...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on circuits and systems. I, Regular papers Regular papers, 2024-04, Vol.71 (4), p.1665-1678 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Due to the bottleneck problems such as scenario-varying application, significant data communication overhead and privacy protection between off-line training and on-line inference, intelligent edge devices capable of adaptively fine-tuning the deep neural network (DNN) models for specific tasks have become the most urgent need. However, the computational cost is intolerable for ordinary on-device training (ODT), which inspires us to explore an efficient ODT processor, named DQ-STP. In this paper, we leverage a series of optimization techniques using software-hardware co-design. On the one hand, the proposed design incorporates SVD-based low-rank decomposition, 2^{n} quantization and ACBN algorithm on the software side. This unifies the sparse computing mode of convolutional layers and enhancing weight sparsity. On the other hand, the proposed design effectively leverages data sparsity on the hardware side through four techniques: 1) The flag compressed sparse row is proposed to compress input feature maps and gradient maps. 2) A unified processing element (PE) array comprising shifters and adders is proposed to expedite forward and error propagation steps. 3) The PE arrays for error propagation and weight gradients generation are separated to enhance throughput. 4) A sparse alignment strategy is proposed to further enhance PE utilization. Through these software and hardware co-optimization, the proposed DQ-STP achieves an area efficiency and peak energy efficiency of 41.2 GOPS/mm2 and 90.63 TOPS/W. In comparison to state-of-the-art reference designs, the proposed DQ-STP demonstrates a 2.19\times improvement in normalized area efficiency and a 1.85\times enhancement in energy efficiency. |
---|---|
ISSN: | 1549-8328 1558-0806 |
DOI: | 10.1109/TCSI.2024.3364093 |