PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors
We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend...
Gespeichert in:
Veröffentlicht in: | Philosophical transactions of the Royal Society of London. Series A: Mathematical, physical, and engineering sciences physical, and engineering sciences, 2020-02, Vol.378 (2164), p.20190155-20190155 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 20190155 |
---|---|
container_issue | 2164 |
container_start_page | 20190155 |
container_title | Philosophical transactions of the Royal Society of London. Series A: Mathematical, physical, and engineering sciences |
container_volume | 378 |
creator | Garofalo, Angelo Rusci, Manuele Conti, Francesco Rossi, Davide Benini, Luca |
description | We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster's parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63 × with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30 × and 19.6 × less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8 × and by 7.45 × the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1 × higher than STM32L4 and 39.5 × higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue 'Harmonizing energy-autonomous computing and intelligence'. |
doi_str_mv | 10.1098/rsta.2019.0155 |
format | Article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6939244</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2330062500</sourcerecordid><originalsourceid>FETCH-LOGICAL-c539t-5f1ad91c22c4d75311f567264ff20e9a695e00f01248aff421a75712cb010a4b3</originalsourceid><addsrcrecordid>eNpVkU1PHDEMhiPUCijlyhHNsZds7XzNpodK1QpapBVFFBAXFGWzCQzNToZkhhX99Z0VH2pPtuzXry0_hBwgTBD09HMuvZ0wQD0BlHKL7KKokTKt2Lsx50pQCfx6h3wo5R4AUUm2TXY4TpWc1vUuuTm7nJ_R09MvlXXOR59t37S31cNg277545dV64ds4xj6dcq_S5XaqrNjJfpYDbHPlsa0pl1a-1ydn_ya0auqy8n5UlIuH8n7YGPx-y9xj1weH13MftD5z-8ns29z6iTXPZUB7VKjY8yJZS05YpCqZkqEwMBrq7T0AAGQiakNQTC0tayRuQUgWLHge-Trs283LFZ-6Xw7HhZNl5uVzU8m2cb832mbO3ObHo3SXDMhRoNPLwY5PQy-9GbVlPEf0bY-DcUwzgEUkwCjdPIsdTmVkn14W4NgNkzMhonZMDEbJuPA4b_HvclfIfC_nMKJZQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2330062500</pqid></control><display><type>article</type><title>PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors</title><source>Alma/SFX Local Collection</source><source>Free Full-Text Journals in Chemistry</source><source>JSTOR Mathematics & Statistics</source><creator>Garofalo, Angelo ; Rusci, Manuele ; Conti, Francesco ; Rossi, Davide ; Benini, Luca</creator><creatorcontrib>Garofalo, Angelo ; Rusci, Manuele ; Conti, Francesco ; Rossi, Davide ; Benini, Luca</creatorcontrib><description>We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster's parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63 × with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30 × and 19.6 × less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8 × and by 7.45 × the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1 × higher than STM32L4 and 39.5 × higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue 'Harmonizing energy-autonomous computing and intelligence'.</description><identifier>ISSN: 1364-503X</identifier><identifier>EISSN: 1471-2962</identifier><identifier>DOI: 10.1098/rsta.2019.0155</identifier><identifier>PMID: 31865877</identifier><language>eng</language><publisher>England: The Royal Society Publishing</publisher><ispartof>Philosophical transactions of the Royal Society of London. Series A: Mathematical, physical, and engineering sciences, 2020-02, Vol.378 (2164), p.20190155-20190155</ispartof><rights>2019 The Author(s) 2019</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c539t-5f1ad91c22c4d75311f567264ff20e9a695e00f01248aff421a75712cb010a4b3</citedby><cites>FETCH-LOGICAL-c539t-5f1ad91c22c4d75311f567264ff20e9a695e00f01248aff421a75712cb010a4b3</cites><orcidid>0000-0001-8068-3806</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,776,780,881,27901,27902</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/31865877$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Garofalo, Angelo</creatorcontrib><creatorcontrib>Rusci, Manuele</creatorcontrib><creatorcontrib>Conti, Francesco</creatorcontrib><creatorcontrib>Rossi, Davide</creatorcontrib><creatorcontrib>Benini, Luca</creatorcontrib><title>PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors</title><title>Philosophical transactions of the Royal Society of London. Series A: Mathematical, physical, and engineering sciences</title><addtitle>Philos Trans A Math Phys Eng Sci</addtitle><description>We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster's parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63 × with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30 × and 19.6 × less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8 × and by 7.45 × the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1 × higher than STM32L4 and 39.5 × higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue 'Harmonizing energy-autonomous computing and intelligence'.</description><issn>1364-503X</issn><issn>1471-2962</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNpVkU1PHDEMhiPUCijlyhHNsZds7XzNpodK1QpapBVFFBAXFGWzCQzNToZkhhX99Z0VH2pPtuzXry0_hBwgTBD09HMuvZ0wQD0BlHKL7KKokTKt2Lsx50pQCfx6h3wo5R4AUUm2TXY4TpWc1vUuuTm7nJ_R09MvlXXOR59t37S31cNg277545dV64ds4xj6dcq_S5XaqrNjJfpYDbHPlsa0pl1a-1ydn_ya0auqy8n5UlIuH8n7YGPx-y9xj1weH13MftD5z-8ns29z6iTXPZUB7VKjY8yJZS05YpCqZkqEwMBrq7T0AAGQiakNQTC0tayRuQUgWLHge-Trs283LFZ-6Xw7HhZNl5uVzU8m2cb832mbO3ObHo3SXDMhRoNPLwY5PQy-9GbVlPEf0bY-DcUwzgEUkwCjdPIsdTmVkn14W4NgNkzMhonZMDEbJuPA4b_HvclfIfC_nMKJZQ</recordid><startdate>20200207</startdate><enddate>20200207</enddate><creator>Garofalo, Angelo</creator><creator>Rusci, Manuele</creator><creator>Conti, Francesco</creator><creator>Rossi, Davide</creator><creator>Benini, Luca</creator><general>The Royal Society Publishing</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0001-8068-3806</orcidid></search><sort><creationdate>20200207</creationdate><title>PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors</title><author>Garofalo, Angelo ; Rusci, Manuele ; Conti, Francesco ; Rossi, Davide ; Benini, Luca</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c539t-5f1ad91c22c4d75311f567264ff20e9a695e00f01248aff421a75712cb010a4b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Garofalo, Angelo</creatorcontrib><creatorcontrib>Rusci, Manuele</creatorcontrib><creatorcontrib>Conti, Francesco</creatorcontrib><creatorcontrib>Rossi, Davide</creatorcontrib><creatorcontrib>Benini, Luca</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Philosophical transactions of the Royal Society of London. Series A: Mathematical, physical, and engineering sciences</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Garofalo, Angelo</au><au>Rusci, Manuele</au><au>Conti, Francesco</au><au>Rossi, Davide</au><au>Benini, Luca</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors</atitle><jtitle>Philosophical transactions of the Royal Society of London. Series A: Mathematical, physical, and engineering sciences</jtitle><addtitle>Philos Trans A Math Phys Eng Sci</addtitle><date>2020-02-07</date><risdate>2020</risdate><volume>378</volume><issue>2164</issue><spage>20190155</spage><epage>20190155</epage><pages>20190155-20190155</pages><issn>1364-503X</issn><eissn>1471-2962</eissn><abstract>We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster's parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63 × with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30 × and 19.6 × less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8 × and by 7.45 × the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1 × higher than STM32L4 and 39.5 × higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue 'Harmonizing energy-autonomous computing and intelligence'.</abstract><cop>England</cop><pub>The Royal Society Publishing</pub><pmid>31865877</pmid><doi>10.1098/rsta.2019.0155</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0001-8068-3806</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1364-503X |
ispartof | Philosophical transactions of the Royal Society of London. Series A: Mathematical, physical, and engineering sciences, 2020-02, Vol.378 (2164), p.20190155-20190155 |
issn | 1364-503X 1471-2962 |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6939244 |
source | Alma/SFX Local Collection; Free Full-Text Journals in Chemistry; JSTOR Mathematics & Statistics |
title | PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T09%3A34%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PULP-NN:%20accelerating%20quantized%20neural%20networks%20on%20parallel%20ultra-low-power%20RISC-V%20processors&rft.jtitle=Philosophical%20transactions%20of%20the%20Royal%20Society%20of%20London.%20Series%20A:%20Mathematical,%20physical,%20and%20engineering%20sciences&rft.au=Garofalo,%20Angelo&rft.date=2020-02-07&rft.volume=378&rft.issue=2164&rft.spage=20190155&rft.epage=20190155&rft.pages=20190155-20190155&rft.issn=1364-503X&rft.eissn=1471-2962&rft_id=info:doi/10.1098/rsta.2019.0155&rft_dat=%3Cproquest_pubme%3E2330062500%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2330062500&rft_id=info:pmid/31865877&rfr_iscdi=true |