DQ-STP: An Efficient Sparse On-Device Training Processor Based on Low-Rank Decomposition and Quantization for DNN

Due to the bottleneck problems such as scenario-varying application, significant data communication overhead and privacy protection between off-line training and on-line inference, intelligent edge devices capable of adaptively fine-tuning the deep neural network (DNN) models for specific tasks have...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems. I, Regular papers Regular papers, 2024-04, Vol.71 (4), p.1665-1678
Hauptverfasser:	Li, Baoting, Zhang, Danqing, Zhao, Pengfei, Wang, Hang, Zhang, Xuchong, Sun, Hongbin, Zheng, Nanning
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Arrays Artificial neural networks Co-design Data communication Decomposition Deep neural network Design optimization Energy efficiency Feature maps Hardware Microprocessors on-device training processor Optimization techniques quantization Quantization (signal) Software Sparsity sparsity exploitation Tensors Throughput Training weight low-rank decomposition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Due to the bottleneck problems such as scenario-varying application, significant data communication overhead and privacy protection between off-line training and on-line inference, intelligent edge devices capable of adaptively fine-tuning the deep neural network (DNN) models for specific tasks have become the most urgent need. However, the computational cost is intolerable for ordinary on-device training (ODT), which inspires us to explore an efficient ODT processor, named DQ-STP. In this paper, we leverage a series of optimization techniques using software-hardware co-design. On the one hand, the proposed design incorporates SVD-based low-rank decomposition, 2^{n} quantization and ACBN algorithm on the software side. This unifies the sparse computing mode of convolutional layers and enhancing weight sparsity. On the other hand, the proposed design effectively leverages data sparsity on the hardware side through four techniques: 1) The flag compressed sparse row is proposed to compress input feature maps and gradient maps. 2) A unified processing element (PE) array comprising shifters and adders is proposed to expedite forward and error propagation steps. 3) The PE arrays for error propagation and weight gradients generation are separated to enhance throughput. 4) A sparse alignment strategy is proposed to further enhance PE utilization. Through these software and hardware co-optimization, the proposed DQ-STP achieves an area efficiency and peak energy efficiency of 41.2 GOPS/mm2 and 90.63 TOPS/W. In comparison to state-of-the-art reference designs, the proposed DQ-STP demonstrates a 2.19\times improvement in normalized area efficiency and a 1.85\times enhancement in energy efficiency.
ISSN:	1549-8328 1558-0806
DOI:	10.1109/TCSI.2024.3364093