HILP: hardware-in-loop pruning of convolutional neural networks towards inference acceleration

Successful deployment of convolutional neural networks on resource-constrained hardware platforms is challenging for ubiquitous AI applications. For latency-sensitive scenarios, real-time inference requires model compression techniques such as network pruning to achieve the purpose of inference acce...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neural computing & applications 2024-05, Vol.36 (15), p.8825-8842
Hauptverfasser: Li, Dong, Ye, Qianqian, Guo, Xiaoyue, Sun, Yunda, Zhang, Li
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Successful deployment of convolutional neural networks on resource-constrained hardware platforms is challenging for ubiquitous AI applications. For latency-sensitive scenarios, real-time inference requires model compression techniques such as network pruning to achieve the purpose of inference acceleration. However, many researches focus on hardware-independent filter pruning methods, which cannot balance the contribution of pruned structure to latency and the accuracy drop. Although some pruning methods have introduced latency constraints into the pruning process, most of them are based on look-up tables, which omits the key step of hardware optimization, resulting in significant deviation in latency estimation. In this paper, we propose a novel latency-constrained pruning method, named hardware-in-loop pruning (HILP). It is based on the fast optimal pruning rate search within the layer and layer-wise hybrid pruning, which can prioritize removing the less important layers with considerable latency contributions. The proposed hardware-in-loop pipeline enables the hardware optimization module to be integrated into the entire framework. During pruning, an intermediate network architecture is automatically transformed to a deployable model for accurate latency measurement. The latency-optimized intermediate architecture is then selected by traversing all layers for next progressive step. HILP is generally applicable to any platform that provides a hardware optimization toolchain, such as NVIDIA GPU and Cambricon NPU. We evaluate HILP on both image classification task using ResNet50 with ImageNet and object detection task using YOLOv3 with COCO, HILP can reduce the inference latency of these two networks to 60% and 75%, respectively, within the range of accuracy variation not exceeding 0.6%. Extensive experiment results have proven that HILP is able to achieve a significant advantage in latency-accuracy performance compared to state-of-the-art methods.
ISSN:0941-0643
1433-3058
DOI:10.1007/s00521-024-09539-8