BitBlade: Energy-Efficient Variable Bit-Precision Hardware Accelerator for Quantized Neural Networks

We introduce an area/energy-efficient precision-scalable neural network accelerator architecture. Previous precision-scalable hardware accelerators have limitations such as the under-utilization of multipliers for low bit-width operations and the large area overhead to support various bit precisions...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE journal of solid-state circuits 2022-06, Vol.57 (6), p.1924-1935
Hauptverfasser: Ryu, Sungju, Kim, Hyungjun, Yi, Wooseok, Kim, Eunhwan, Kim, Yulhwa, Kim, Taesu, Kim, Jae-Joon
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1935
container_issue 6
container_start_page 1924
container_title IEEE journal of solid-state circuits
container_volume 57
creator Ryu, Sungju
Kim, Hyungjun
Yi, Wooseok
Kim, Eunhwan
Kim, Yulhwa
Kim, Taesu
Kim, Jae-Joon
description We introduce an area/energy-efficient precision-scalable neural network accelerator architecture. Previous precision-scalable hardware accelerators have limitations such as the under-utilization of multipliers for low bit-width operations and the large area overhead to support various bit precisions. To mitigate the problems, we first propose a bitwise summation, which reduces the area overhead for the bit-width scaling. In addition, we present a channel-wise aligning scheme (CAS) to efficiently fetch inputs and weights from on-chip SRAM buffers and a channel-first and pixel-last tiling (CFPL) scheme to maximize the utilization of multipliers on various kernel sizes. A test chip was implemented in 28-nm CMOS technology, and the experimental results show that the throughput and energy efficiency of our chip are up to 7.7 \times and 1.64 \times higher than those of the state-of-the-art designs, respectively. Moreover, additional 1.5-3.4 \times throughput gains can be achieved using the CFPL method compared to the CAS.
doi_str_mv 10.1109/JSSC.2022.3141050
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9689050</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9689050</ieee_id><sourcerecordid>2669160875</sourcerecordid><originalsourceid>FETCH-LOGICAL-c293t-1b6f4b7f95b86b75f702da37265e566580cd2f33e042dee4a9a066634276e6a43</originalsourceid><addsrcrecordid>eNo9kE1Lw0AQhhdRsFZ_gHgJeE7d72S9taVapfhBVbwtm2RWtsak3U0o9deb0OJheBnmmRl4ELokeEQIVjePy-V0RDGlI0Y4wQIfoQERIo1Jwj6P0QBjksaKYnyKzkJYdS3nKRmgYuKaSWkKuI1mFfivXTyz1uUOqib6MN6ZrISoY-IXD7kLrq6iufHF1niIxnkOJXjT1D6yXb22pmrcLxTRE7TelF0029p_h3N0Yk0Z4OKQQ_R-N3ubzuPF8_3DdLyIc6pYE5NMWp4lVokslVkibIJpYVhCpQAhpUhxXlDLGGBOCwBulMFSSsZpIkEazoboen937etNC6HRq7r1VfdSUykVkThNREeRPZX7OgQPVq-9-zF-pwnWvUzdy9S9TH2Q2e1c7XccAPzzSqaqn_4BjZhwDQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2669160875</pqid></control><display><type>article</type><title>BitBlade: Energy-Efficient Variable Bit-Precision Hardware Accelerator for Quantized Neural Networks</title><source>IEEE Electronic Library (IEL)</source><creator>Ryu, Sungju ; Kim, Hyungjun ; Yi, Wooseok ; Kim, Eunhwan ; Kim, Yulhwa ; Kim, Taesu ; Kim, Jae-Joon</creator><creatorcontrib>Ryu, Sungju ; Kim, Hyungjun ; Yi, Wooseok ; Kim, Eunhwan ; Kim, Yulhwa ; Kim, Taesu ; Kim, Jae-Joon</creatorcontrib><description><![CDATA[We introduce an area/energy-efficient precision-scalable neural network accelerator architecture. Previous precision-scalable hardware accelerators have limitations such as the under-utilization of multipliers for low bit-width operations and the large area overhead to support various bit precisions. To mitigate the problems, we first propose a bitwise summation, which reduces the area overhead for the bit-width scaling. In addition, we present a channel-wise aligning scheme (CAS) to efficiently fetch inputs and weights from on-chip SRAM buffers and a channel-first and pixel-last tiling (CFPL) scheme to maximize the utilization of multipliers on various kernel sizes. A test chip was implemented in 28-nm CMOS technology, and the experimental results show that the throughput and energy efficiency of our chip are up to 7.7<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> and 1.64<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> higher than those of the state-of-the-art designs, respectively. Moreover, additional 1.5-3.4<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> throughput gains can be achieved using the CFPL method compared to the CAS.]]></description><identifier>ISSN: 0018-9200</identifier><identifier>EISSN: 1558-173X</identifier><identifier>DOI: 10.1109/JSSC.2022.3141050</identifier><identifier>CODEN: IJSCBC</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Adders ; Arrays ; Bit-precision scaling ; bitwise summation ; channel-first and pixel-last tiling (CFPL) ; channel-wise aligning ; Computer architecture ; deep neural network ; Energy efficiency ; Hardware ; Hardware acceleration ; hardware accelerator ; Multipliers ; multiply–accumulate unit ; Neural networks ; Random access memory ; Throughput ; Tiling</subject><ispartof>IEEE journal of solid-state circuits, 2022-06, Vol.57 (6), p.1924-1935</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c293t-1b6f4b7f95b86b75f702da37265e566580cd2f33e042dee4a9a066634276e6a43</citedby><cites>FETCH-LOGICAL-c293t-1b6f4b7f95b86b75f702da37265e566580cd2f33e042dee4a9a066634276e6a43</cites><orcidid>0000-0003-3735-821X ; 0000-0002-0254-391X ; 0000-0001-8403-1557 ; 0000-0001-5175-8258 ; 0000-0001-7987-527X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9689050$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9689050$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ryu, Sungju</creatorcontrib><creatorcontrib>Kim, Hyungjun</creatorcontrib><creatorcontrib>Yi, Wooseok</creatorcontrib><creatorcontrib>Kim, Eunhwan</creatorcontrib><creatorcontrib>Kim, Yulhwa</creatorcontrib><creatorcontrib>Kim, Taesu</creatorcontrib><creatorcontrib>Kim, Jae-Joon</creatorcontrib><title>BitBlade: Energy-Efficient Variable Bit-Precision Hardware Accelerator for Quantized Neural Networks</title><title>IEEE journal of solid-state circuits</title><addtitle>JSSC</addtitle><description><![CDATA[We introduce an area/energy-efficient precision-scalable neural network accelerator architecture. Previous precision-scalable hardware accelerators have limitations such as the under-utilization of multipliers for low bit-width operations and the large area overhead to support various bit precisions. To mitigate the problems, we first propose a bitwise summation, which reduces the area overhead for the bit-width scaling. In addition, we present a channel-wise aligning scheme (CAS) to efficiently fetch inputs and weights from on-chip SRAM buffers and a channel-first and pixel-last tiling (CFPL) scheme to maximize the utilization of multipliers on various kernel sizes. A test chip was implemented in 28-nm CMOS technology, and the experimental results show that the throughput and energy efficiency of our chip are up to 7.7<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> and 1.64<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> higher than those of the state-of-the-art designs, respectively. Moreover, additional 1.5-3.4<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> throughput gains can be achieved using the CFPL method compared to the CAS.]]></description><subject>Adders</subject><subject>Arrays</subject><subject>Bit-precision scaling</subject><subject>bitwise summation</subject><subject>channel-first and pixel-last tiling (CFPL)</subject><subject>channel-wise aligning</subject><subject>Computer architecture</subject><subject>deep neural network</subject><subject>Energy efficiency</subject><subject>Hardware</subject><subject>Hardware acceleration</subject><subject>hardware accelerator</subject><subject>Multipliers</subject><subject>multiply–accumulate unit</subject><subject>Neural networks</subject><subject>Random access memory</subject><subject>Throughput</subject><subject>Tiling</subject><issn>0018-9200</issn><issn>1558-173X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1Lw0AQhhdRsFZ_gHgJeE7d72S9taVapfhBVbwtm2RWtsak3U0o9deb0OJheBnmmRl4ELokeEQIVjePy-V0RDGlI0Y4wQIfoQERIo1Jwj6P0QBjksaKYnyKzkJYdS3nKRmgYuKaSWkKuI1mFfivXTyz1uUOqib6MN6ZrISoY-IXD7kLrq6iufHF1niIxnkOJXjT1D6yXb22pmrcLxTRE7TelF0029p_h3N0Yk0Z4OKQQ_R-N3ubzuPF8_3DdLyIc6pYE5NMWp4lVokslVkibIJpYVhCpQAhpUhxXlDLGGBOCwBulMFSSsZpIkEazoboen937etNC6HRq7r1VfdSUykVkThNREeRPZX7OgQPVq-9-zF-pwnWvUzdy9S9TH2Q2e1c7XccAPzzSqaqn_4BjZhwDQ</recordid><startdate>20220601</startdate><enddate>20220601</enddate><creator>Ryu, Sungju</creator><creator>Kim, Hyungjun</creator><creator>Yi, Wooseok</creator><creator>Kim, Eunhwan</creator><creator>Kim, Yulhwa</creator><creator>Kim, Taesu</creator><creator>Kim, Jae-Joon</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>L7M</scope><orcidid>https://orcid.org/0000-0003-3735-821X</orcidid><orcidid>https://orcid.org/0000-0002-0254-391X</orcidid><orcidid>https://orcid.org/0000-0001-8403-1557</orcidid><orcidid>https://orcid.org/0000-0001-5175-8258</orcidid><orcidid>https://orcid.org/0000-0001-7987-527X</orcidid></search><sort><creationdate>20220601</creationdate><title>BitBlade: Energy-Efficient Variable Bit-Precision Hardware Accelerator for Quantized Neural Networks</title><author>Ryu, Sungju ; Kim, Hyungjun ; Yi, Wooseok ; Kim, Eunhwan ; Kim, Yulhwa ; Kim, Taesu ; Kim, Jae-Joon</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c293t-1b6f4b7f95b86b75f702da37265e566580cd2f33e042dee4a9a066634276e6a43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Adders</topic><topic>Arrays</topic><topic>Bit-precision scaling</topic><topic>bitwise summation</topic><topic>channel-first and pixel-last tiling (CFPL)</topic><topic>channel-wise aligning</topic><topic>Computer architecture</topic><topic>deep neural network</topic><topic>Energy efficiency</topic><topic>Hardware</topic><topic>Hardware acceleration</topic><topic>hardware accelerator</topic><topic>Multipliers</topic><topic>multiply–accumulate unit</topic><topic>Neural networks</topic><topic>Random access memory</topic><topic>Throughput</topic><topic>Tiling</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ryu, Sungju</creatorcontrib><creatorcontrib>Kim, Hyungjun</creatorcontrib><creatorcontrib>Yi, Wooseok</creatorcontrib><creatorcontrib>Kim, Eunhwan</creatorcontrib><creatorcontrib>Kim, Yulhwa</creatorcontrib><creatorcontrib>Kim, Taesu</creatorcontrib><creatorcontrib>Kim, Jae-Joon</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEEE journal of solid-state circuits</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ryu, Sungju</au><au>Kim, Hyungjun</au><au>Yi, Wooseok</au><au>Kim, Eunhwan</au><au>Kim, Yulhwa</au><au>Kim, Taesu</au><au>Kim, Jae-Joon</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>BitBlade: Energy-Efficient Variable Bit-Precision Hardware Accelerator for Quantized Neural Networks</atitle><jtitle>IEEE journal of solid-state circuits</jtitle><stitle>JSSC</stitle><date>2022-06-01</date><risdate>2022</risdate><volume>57</volume><issue>6</issue><spage>1924</spage><epage>1935</epage><pages>1924-1935</pages><issn>0018-9200</issn><eissn>1558-173X</eissn><coden>IJSCBC</coden><abstract><![CDATA[We introduce an area/energy-efficient precision-scalable neural network accelerator architecture. Previous precision-scalable hardware accelerators have limitations such as the under-utilization of multipliers for low bit-width operations and the large area overhead to support various bit precisions. To mitigate the problems, we first propose a bitwise summation, which reduces the area overhead for the bit-width scaling. In addition, we present a channel-wise aligning scheme (CAS) to efficiently fetch inputs and weights from on-chip SRAM buffers and a channel-first and pixel-last tiling (CFPL) scheme to maximize the utilization of multipliers on various kernel sizes. A test chip was implemented in 28-nm CMOS technology, and the experimental results show that the throughput and energy efficiency of our chip are up to 7.7<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> and 1.64<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> higher than those of the state-of-the-art designs, respectively. Moreover, additional 1.5-3.4<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> throughput gains can be achieved using the CFPL method compared to the CAS.]]></abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/JSSC.2022.3141050</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-3735-821X</orcidid><orcidid>https://orcid.org/0000-0002-0254-391X</orcidid><orcidid>https://orcid.org/0000-0001-8403-1557</orcidid><orcidid>https://orcid.org/0000-0001-5175-8258</orcidid><orcidid>https://orcid.org/0000-0001-7987-527X</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 0018-9200
ispartof IEEE journal of solid-state circuits, 2022-06, Vol.57 (6), p.1924-1935
issn 0018-9200
1558-173X
language eng
recordid cdi_ieee_primary_9689050
source IEEE Electronic Library (IEL)
subjects Adders
Arrays
Bit-precision scaling
bitwise summation
channel-first and pixel-last tiling (CFPL)
channel-wise aligning
Computer architecture
deep neural network
Energy efficiency
Hardware
Hardware acceleration
hardware accelerator
Multipliers
multiply–accumulate unit
Neural networks
Random access memory
Throughput
Tiling
title BitBlade: Energy-Efficient Variable Bit-Precision Hardware Accelerator for Quantized Neural Networks
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T13%3A54%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=BitBlade:%20Energy-Efficient%20Variable%20Bit-Precision%20Hardware%20Accelerator%20for%20Quantized%20Neural%20Networks&rft.jtitle=IEEE%20journal%20of%20solid-state%20circuits&rft.au=Ryu,%20Sungju&rft.date=2022-06-01&rft.volume=57&rft.issue=6&rft.spage=1924&rft.epage=1935&rft.pages=1924-1935&rft.issn=0018-9200&rft.eissn=1558-173X&rft.coden=IJSCBC&rft_id=info:doi/10.1109/JSSC.2022.3141050&rft_dat=%3Cproquest_RIE%3E2669160875%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2669160875&rft_id=info:pmid/&rft_ieee_id=9689050&rfr_iscdi=true