An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks

Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on very large scale integration (VLSI) systems 2024-10, Vol.32 (10), p.1889-1899
Hauptverfasser:	Zhang, Heng, Yin, Wenhe, He, Sunan, Du, Yuan, Du, Li
Format:	Artikel
Sprache:	eng
Schlagworte:	Analog circuits Analog rectified linear unit (ReLU) circuit Analog to digital converters Arrays Capacitors Circuits Common Information Model (computing) Computational modeling compute-in-memory (CIM) Computer architecture Digital to analog converters Efficiency feed-forward networks (FFNs) In-memory computing Network latency pipelined operation Tensors transformer accelerators Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1899
container_issue	10
container_start_page	1889
container_title	IEEE transactions on very large scale integration (VLSI) systems
container_volume	32
creator	Zhang, Heng Yin, Wenhe He, Sunan Du, Yuan Du, Li
description	Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results' staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks.
doi_str_mv	10.1109/TVLSI.2024.3432403
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_3110467071</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10614387</ieee_id><sourcerecordid>3110467071</sourcerecordid><originalsourceid>FETCH-LOGICAL-c177t-7b34b61cb8c7833b8fd9cc3817142a7f38bb4bfbeef4c9ded98a0652d64bd0c93</originalsourceid><addsrcrecordid>eNpNkE9LAzEQxRdRsFa_gHgIeE5NNukmeyyl1UKrQlevIclOSvpns2a3lH57t7YH5zLD8N485pckj5QMKCX5S_E9X84GKUn5gHGWcsKukh4dDgXOu7ruZpIxLFNKbpO7plkTQjnPSS_ZjCo0cc5bD1WLikPAy1avAH36Gra-ghKNw67et4B9hRewC_GIFtrGgFyIaGQtbCHq1lcrVERdNd12BxFNAUo8DfGgY4neoT2EuGnukxuntw08XHo_-ZpOivEbnn-8zsajObZUiBYLw7jJqDXSCsmYka7MrWWSCspTLRyTxnDjDIDjNi-hzKUm2TAtM25KYnPWT57Pd-sYfvbQtGod9rHqIhXraPFMEEE7VXpWdc80TQSn6uh3Oh4VJeoEVf1BVSeo6gK1Mz2dTR4A_hkyypkU7BdcVXTe</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3110467071</pqid></control><display><type>article</type><title>An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Heng ; Yin, Wenhe ; He, Sunan ; Du, Yuan ; Du, Li</creator><creatorcontrib>Zhang, Heng ; Yin, Wenhe ; He, Sunan ; Du, Yuan ; Du, Li</creatorcontrib><description>Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results' staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks.</description><identifier>ISSN: 1063-8210</identifier><identifier>EISSN: 1557-9999</identifier><identifier>DOI: 10.1109/TVLSI.2024.3432403</identifier><identifier>CODEN: IEVSE9</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Analog circuits ; Analog rectified linear unit (ReLU) circuit ; Analog to digital converters ; Arrays ; Capacitors ; Circuits ; Common Information Model (computing) ; Computational modeling ; compute-in-memory (CIM) ; Computer architecture ; Digital to analog converters ; Efficiency ; feed-forward networks (FFNs) ; In-memory computing ; Network latency ; pipelined operation ; Tensors ; transformer accelerators ; Transformers</subject><ispartof>IEEE transactions on very large scale integration (VLSI) systems, 2024-10, Vol.32 (10), p.1889-1899</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c177t-7b34b61cb8c7833b8fd9cc3817142a7f38bb4bfbeef4c9ded98a0652d64bd0c93</cites><orcidid>0000-0002-5316-619X ; 0000-0003-2687-6978 ; 0009-0001-0664-6035 ; 0009-0004-4604-9587 ; 0009-0001-7827-7849</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10614387$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10614387$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Heng</creatorcontrib><creatorcontrib>Yin, Wenhe</creatorcontrib><creatorcontrib>He, Sunan</creatorcontrib><creatorcontrib>Du, Yuan</creatorcontrib><creatorcontrib>Du, Li</creatorcontrib><title>An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks</title><title>IEEE transactions on very large scale integration (VLSI) systems</title><addtitle>TVLSI</addtitle><description>Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results' staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks.</description><subject>Analog circuits</subject><subject>Analog rectified linear unit (ReLU) circuit</subject><subject>Analog to digital converters</subject><subject>Arrays</subject><subject>Capacitors</subject><subject>Circuits</subject><subject>Common Information Model (computing)</subject><subject>Computational modeling</subject><subject>compute-in-memory (CIM)</subject><subject>Computer architecture</subject><subject>Digital to analog converters</subject><subject>Efficiency</subject><subject>feed-forward networks (FFNs)</subject><subject>In-memory computing</subject><subject>Network latency</subject><subject>pipelined operation</subject><subject>Tensors</subject><subject>transformer accelerators</subject><subject>Transformers</subject><issn>1063-8210</issn><issn>1557-9999</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE9LAzEQxRdRsFa_gHgIeE5NNukmeyyl1UKrQlevIclOSvpns2a3lH57t7YH5zLD8N485pckj5QMKCX5S_E9X84GKUn5gHGWcsKukh4dDgXOu7ruZpIxLFNKbpO7plkTQjnPSS_ZjCo0cc5bD1WLikPAy1avAH36Gra-ghKNw67et4B9hRewC_GIFtrGgFyIaGQtbCHq1lcrVERdNd12BxFNAUo8DfGgY4neoT2EuGnukxuntw08XHo_-ZpOivEbnn-8zsajObZUiBYLw7jJqDXSCsmYka7MrWWSCspTLRyTxnDjDIDjNi-hzKUm2TAtM25KYnPWT57Pd-sYfvbQtGod9rHqIhXraPFMEEE7VXpWdc80TQSn6uh3Oh4VJeoEVf1BVSeo6gK1Mz2dTR4A_hkyypkU7BdcVXTe</recordid><startdate>20241001</startdate><enddate>20241001</enddate><creator>Zhang, Heng</creator><creator>Yin, Wenhe</creator><creator>He, Sunan</creator><creator>Du, Yuan</creator><creator>Du, Li</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>L7M</scope><orcidid>https://orcid.org/0000-0002-5316-619X</orcidid><orcidid>https://orcid.org/0000-0003-2687-6978</orcidid><orcidid>https://orcid.org/0009-0001-0664-6035</orcidid><orcidid>https://orcid.org/0009-0004-4604-9587</orcidid><orcidid>https://orcid.org/0009-0001-7827-7849</orcidid></search><sort><creationdate>20241001</creationdate><title>An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks</title><author>Zhang, Heng ; Yin, Wenhe ; He, Sunan ; Du, Yuan ; Du, Li</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c177t-7b34b61cb8c7833b8fd9cc3817142a7f38bb4bfbeef4c9ded98a0652d64bd0c93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Analog circuits</topic><topic>Analog rectified linear unit (ReLU) circuit</topic><topic>Analog to digital converters</topic><topic>Arrays</topic><topic>Capacitors</topic><topic>Circuits</topic><topic>Common Information Model (computing)</topic><topic>Computational modeling</topic><topic>compute-in-memory (CIM)</topic><topic>Computer architecture</topic><topic>Digital to analog converters</topic><topic>Efficiency</topic><topic>feed-forward networks (FFNs)</topic><topic>In-memory computing</topic><topic>Network latency</topic><topic>pipelined operation</topic><topic>Tensors</topic><topic>transformer accelerators</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Heng</creatorcontrib><creatorcontrib>Yin, Wenhe</creatorcontrib><creatorcontrib>He, Sunan</creatorcontrib><creatorcontrib>Du, Yuan</creatorcontrib><creatorcontrib>Du, Li</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEEE transactions on very large scale integration (VLSI) systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Heng</au><au>Yin, Wenhe</au><au>He, Sunan</au><au>Du, Yuan</au><au>Du, Li</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks</atitle><jtitle>IEEE transactions on very large scale integration (VLSI) systems</jtitle><stitle>TVLSI</stitle><date>2024-10-01</date><risdate>2024</risdate><volume>32</volume><issue>10</issue><spage>1889</spage><epage>1899</epage><pages>1889-1899</pages><issn>1063-8210</issn><eissn>1557-9999</eissn><coden>IEVSE9</coden><abstract>Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results' staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TVLSI.2024.3432403</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-5316-619X</orcidid><orcidid>https://orcid.org/0000-0003-2687-6978</orcidid><orcidid>https://orcid.org/0009-0001-0664-6035</orcidid><orcidid>https://orcid.org/0009-0004-4604-9587</orcidid><orcidid>https://orcid.org/0009-0001-7827-7849</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1063-8210
ispartof	IEEE transactions on very large scale integration (VLSI) systems, 2024-10, Vol.32 (10), p.1889-1899
issn	1063-8210 1557-9999
language	eng
recordid	cdi_proquest_journals_3110467071
source	IEEE Electronic Library (IEL)
subjects	Analog circuits Analog rectified linear unit (ReLU) circuit Analog to digital converters Arrays Capacitors Circuits Common Information Model (computing) Computational modeling compute-in-memory (CIM) Computer architecture Digital to analog converters Efficiency feed-forward networks (FFNs) In-memory computing Network latency pipelined operation Tensors transformer accelerators Transformers
title	An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T13%3A24%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Efficient%20Two-Stage%20Pipelined%20Compute-in-Memory%20Macro%20for%20Accelerating%20Transformer%20Feed-Forward%20Networks&rft.jtitle=IEEE%20transactions%20on%20very%20large%20scale%20integration%20(VLSI)%20systems&rft.au=Zhang,%20Heng&rft.date=2024-10-01&rft.volume=32&rft.issue=10&rft.spage=1889&rft.epage=1899&rft.pages=1889-1899&rft.issn=1063-8210&rft.eissn=1557-9999&rft.coden=IEVSE9&rft_id=info:doi/10.1109/TVLSI.2024.3432403&rft_dat=%3Cproquest_RIE%3E3110467071%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3110467071&rft_id=info:pmid/&rft_ieee_id=10614387&rfr_iscdi=true