Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

[EN] We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is correctly identified. floating point operands. Our code is especially designed...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	San Juan-Sebastian, Pablo, Rodríguez-Sánchez, Rafael, Igual, Francisco D, Alonso-Jordá, Pedro, Quintana-Ortí, Enrique S
Format:	Artikel
Sprache:	eng
Schlagworte:	ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL Deep learning High performance Matrix multiplication NVIDIA Carmel system-on-chip (SoC)
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	[EN] We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is correctly identified. floating point operands. Our code is especially designed for efficient machine learning inference (and to a certain extent, also training) with deep neural networks. The results on the NVIDIA Carmel multicore processor, which implements the ARMv8.2 architecture, show considerable performance gains for the gemm kernel, close to the theoretical peak acceleration that could be expected when moving from 32-bit arithmetic/data to 16-bit. Combined with the type of convolution operator arising in convolutional neural networks, the speed-ups are more modest though still relevant. This work was supported by projects TIN2017-82972-R and RTI2018-093684-B-I00 from the Ministerio de Ciencia, Innovacion y Universidades, project S2018/TCS-4423 of the Comunidad de Madrid, project PR65/19-22445 of the UCM, and project Prometeo/2019/109 of the Generalitat Valenciana. San Juan-Sebastian, P.; Rodríguez-Sánchez, R.; Igual, FD.; Alonso-Jordá, P.; Quintana-Ortí, ES. (2021). Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors. The Journal of Supercomputing. 77(10):11257-11269. https://doi.org/10.1007/s11227-021-03636-4 Deng L et al (2013) Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE international conference on acoustics, speech and signal processing, May, pp 8604–8608 Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems—vol 1, ser. NIPS’12. Curran Associates Inc., USA, pp 1097–1105 Zhang J, Zong C (2015) Deep neural networks in machine translation: an overview. IEEE Intell Syst 30(5):16–25 Devlin J et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 conference North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1, pp 4171–4186 Sze V et al (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329 Vaswani A et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30, pp 5998–6008 Chellapilla K, Puri S, Simard P (200