Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces

Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's so...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on very large scale integration (VLSI) systems 2017-01, Vol.25 (1), p.21-34
Hauptverfasser:	Paulino, Nuno M. C., Canas Ferreira, Joao, Cardoso, Joao M. P.
Format:	Artikel
Sprache:	eng
Schlagworte:	Acceleration Accelerators Binary acceleration Binary codes Computer architecture coprocessor Coprocessors Customization Field programmable gate arrays field-programmable gate array (FPGA) Floating point arithmetic Hardware Integers Kernel Kernels loop accelerator Microprocessors modulo-scheduling Parallel processing Pipelining (computers) Registers Runtime very long instruction word (VLIW) VLIW
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	34
container_issue	1
container_start_page	21
container_title	IEEE transactions on very large scale integration (VLSI) systems
container_volume	25
creator	Paulino, Nuno M. C. Canas Ferreira, Joao Cardoso, Joao M. P.
description	Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloscheduled loop accelerator capable of executing multiple loops and a supporting toolchain. A generation/scheduling procedure, which fully relies on MicroBlaze instruction traces, produces accelerator instances, customized in terms of functional units and interconnections. The accelerators support integer and single-precision floatingpoint arithmetic, and exploit instruction-level parallelism, loop pipelining, and memory access parallelism via two read/write ports. A complete implementation of the proposed architecture is evaluated in a Virtex-7 device. Augmenting a MicroBlaze processor with a tailored accelerator achieves a geometric mean speedup, over software-only execution, of 6.61× for 13 floatingpoint kernels from the Livermore Loops set, and of 4.08× for 11 integer kernels from Texas Instruments' IMGLIB. The proposed customized accelerators are compared with ALU-based ones. The average specialized accelerator requires only 0.47× the number of field-programmable gate array slices of an accelerator with four ALUs. A geometric mean speedup of 1.78× over a four-issue very long instruction word (without floating-point support) was obtained for the integer kernels.
doi_str_mv	10.1109/TVLSI.2016.2573640
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_7506263</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7506263</ieee_id><sourcerecordid>1855465987</sourcerecordid><originalsourceid>FETCH-LOGICAL-c295t-510b80e58448bbf5700cdb3955e6e0f7f48ed81a1587b4986f865d7b7c69a7c73</originalsourceid><addsrcrecordid>eNo9kE9Lw0AQxRdRsFa_gF4CnlNnk_2XYy1aCwEFa69LspmVLW027qYH_fQmrTiXGZj33gw_Qm4pzCiF4mG9Kd9XswyomGVc5oLBGZlQzmVaDHU-zCDyVGUULslVjFsAylgBE7JZYouh6p1vE2-TxSH2fu9-sEnmxuBuXPkQE-tDUnrfJW-uw51rXfs5yh9dW4XvZNXGPhzMMWQdKoPxmlzYahfx5q9Pycfz03rxkpavy9ViXqYmK3ifcgq1AuSKMVXXlksA09R5wTkKBCstU9goWlGuZM0KJawSvJG1NKKopJH5lNyfcrvgvw4Ye731h9AOJzVVnDPBCzWqspPKBB9jQKu74PbD55qCHvnpIz898tN__AbT3cnkEPHfIDmITOT5L0RQbIU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1855465987</pqid></control><display><type>article</type><title>Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces</title><source>IEEE Electronic Library (IEL)</source><creator>Paulino, Nuno M. C. ; Canas Ferreira, Joao ; Cardoso, Joao M. P.</creator><creatorcontrib>Paulino, Nuno M. C. ; Canas Ferreira, Joao ; Cardoso, Joao M. P.</creatorcontrib><description>Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloscheduled loop accelerator capable of executing multiple loops and a supporting toolchain. A generation/scheduling procedure, which fully relies on MicroBlaze instruction traces, produces accelerator instances, customized in terms of functional units and interconnections. The accelerators support integer and single-precision floatingpoint arithmetic, and exploit instruction-level parallelism, loop pipelining, and memory access parallelism via two read/write ports. A complete implementation of the proposed architecture is evaluated in a Virtex-7 device. Augmenting a MicroBlaze processor with a tailored accelerator achieves a geometric mean speedup, over software-only execution, of 6.61× for 13 floatingpoint kernels from the Livermore Loops set, and of 4.08× for 11 integer kernels from Texas Instruments' IMGLIB. The proposed customized accelerators are compared with ALU-based ones. The average specialized accelerator requires only 0.47× the number of field-programmable gate array slices of an accelerator with four ALUs. A geometric mean speedup of 1.78× over a four-issue very long instruction word (without floating-point support) was obtained for the integer kernels.</description><identifier>ISSN: 1063-8210</identifier><identifier>EISSN: 1557-9999</identifier><identifier>DOI: 10.1109/TVLSI.2016.2573640</identifier><identifier>CODEN: IEVSE9</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Acceleration ; Accelerators ; Binary acceleration ; Binary codes ; Computer architecture ; coprocessor ; Coprocessors ; Customization ; Field programmable gate arrays ; field-programmable gate array (FPGA) ; Floating point arithmetic ; Hardware ; Integers ; Kernel ; Kernels ; loop accelerator ; Microprocessors ; modulo-scheduling ; Parallel processing ; Pipelining (computers) ; Registers ; Runtime ; very long instruction word (VLIW) ; VLIW</subject><ispartof>IEEE transactions on very large scale integration (VLSI) systems, 2017-01, Vol.25 (1), p.21-34</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c295t-510b80e58448bbf5700cdb3955e6e0f7f48ed81a1587b4986f865d7b7c69a7c73</citedby><cites>FETCH-LOGICAL-c295t-510b80e58448bbf5700cdb3955e6e0f7f48ed81a1587b4986f865d7b7c69a7c73</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7506263$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27903,27904,54736</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7506263$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Paulino, Nuno M. C.</creatorcontrib><creatorcontrib>Canas Ferreira, Joao</creatorcontrib><creatorcontrib>Cardoso, Joao M. P.</creatorcontrib><title>Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces</title><title>IEEE transactions on very large scale integration (VLSI) systems</title><addtitle>TVLSI</addtitle><description>Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloscheduled loop accelerator capable of executing multiple loops and a supporting toolchain. A generation/scheduling procedure, which fully relies on MicroBlaze instruction traces, produces accelerator instances, customized in terms of functional units and interconnections. The accelerators support integer and single-precision floatingpoint arithmetic, and exploit instruction-level parallelism, loop pipelining, and memory access parallelism via two read/write ports. A complete implementation of the proposed architecture is evaluated in a Virtex-7 device. Augmenting a MicroBlaze processor with a tailored accelerator achieves a geometric mean speedup, over software-only execution, of 6.61× for 13 floatingpoint kernels from the Livermore Loops set, and of 4.08× for 11 integer kernels from Texas Instruments' IMGLIB. The proposed customized accelerators are compared with ALU-based ones. The average specialized accelerator requires only 0.47× the number of field-programmable gate array slices of an accelerator with four ALUs. A geometric mean speedup of 1.78× over a four-issue very long instruction word (without floating-point support) was obtained for the integer kernels.</description><subject>Acceleration</subject><subject>Accelerators</subject><subject>Binary acceleration</subject><subject>Binary codes</subject><subject>Computer architecture</subject><subject>coprocessor</subject><subject>Coprocessors</subject><subject>Customization</subject><subject>Field programmable gate arrays</subject><subject>field-programmable gate array (FPGA)</subject><subject>Floating point arithmetic</subject><subject>Hardware</subject><subject>Integers</subject><subject>Kernel</subject><subject>Kernels</subject><subject>loop accelerator</subject><subject>Microprocessors</subject><subject>modulo-scheduling</subject><subject>Parallel processing</subject><subject>Pipelining (computers)</subject><subject>Registers</subject><subject>Runtime</subject><subject>very long instruction word (VLIW)</subject><subject>VLIW</subject><issn>1063-8210</issn><issn>1557-9999</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE9Lw0AQxRdRsFa_gF4CnlNnk_2XYy1aCwEFa69LspmVLW027qYH_fQmrTiXGZj33gw_Qm4pzCiF4mG9Kd9XswyomGVc5oLBGZlQzmVaDHU-zCDyVGUULslVjFsAylgBE7JZYouh6p1vE2-TxSH2fu9-sEnmxuBuXPkQE-tDUnrfJW-uw51rXfs5yh9dW4XvZNXGPhzMMWQdKoPxmlzYahfx5q9Pycfz03rxkpavy9ViXqYmK3ifcgq1AuSKMVXXlksA09R5wTkKBCstU9goWlGuZM0KJawSvJG1NKKopJH5lNyfcrvgvw4Ye731h9AOJzVVnDPBCzWqspPKBB9jQKu74PbD55qCHvnpIz898tN__AbT3cnkEPHfIDmITOT5L0RQbIU</recordid><startdate>201701</startdate><enddate>201701</enddate><creator>Paulino, Nuno M. C.</creator><creator>Canas Ferreira, Joao</creator><creator>Cardoso, Joao M. P.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>L7M</scope></search><sort><creationdate>201701</creationdate><title>Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces</title><author>Paulino, Nuno M. C. ; Canas Ferreira, Joao ; Cardoso, Joao M. P.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c295t-510b80e58448bbf5700cdb3955e6e0f7f48ed81a1587b4986f865d7b7c69a7c73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Acceleration</topic><topic>Accelerators</topic><topic>Binary acceleration</topic><topic>Binary codes</topic><topic>Computer architecture</topic><topic>coprocessor</topic><topic>Coprocessors</topic><topic>Customization</topic><topic>Field programmable gate arrays</topic><topic>field-programmable gate array (FPGA)</topic><topic>Floating point arithmetic</topic><topic>Hardware</topic><topic>Integers</topic><topic>Kernel</topic><topic>Kernels</topic><topic>loop accelerator</topic><topic>Microprocessors</topic><topic>modulo-scheduling</topic><topic>Parallel processing</topic><topic>Pipelining (computers)</topic><topic>Registers</topic><topic>Runtime</topic><topic>very long instruction word (VLIW)</topic><topic>VLIW</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Paulino, Nuno M. C.</creatorcontrib><creatorcontrib>Canas Ferreira, Joao</creatorcontrib><creatorcontrib>Cardoso, Joao M. P.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEEE transactions on very large scale integration (VLSI) systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Paulino, Nuno M. C.</au><au>Canas Ferreira, Joao</au><au>Cardoso, Joao M. P.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces</atitle><jtitle>IEEE transactions on very large scale integration (VLSI) systems</jtitle><stitle>TVLSI</stitle><date>2017-01</date><risdate>2017</risdate><volume>25</volume><issue>1</issue><spage>21</spage><epage>34</epage><pages>21-34</pages><issn>1063-8210</issn><eissn>1557-9999</eissn><coden>IEVSE9</coden><abstract>Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloscheduled loop accelerator capable of executing multiple loops and a supporting toolchain. A generation/scheduling procedure, which fully relies on MicroBlaze instruction traces, produces accelerator instances, customized in terms of functional units and interconnections. The accelerators support integer and single-precision floatingpoint arithmetic, and exploit instruction-level parallelism, loop pipelining, and memory access parallelism via two read/write ports. A complete implementation of the proposed architecture is evaluated in a Virtex-7 device. Augmenting a MicroBlaze processor with a tailored accelerator achieves a geometric mean speedup, over software-only execution, of 6.61× for 13 floatingpoint kernels from the Livermore Loops set, and of 4.08× for 11 integer kernels from Texas Instruments' IMGLIB. The proposed customized accelerators are compared with ALU-based ones. The average specialized accelerator requires only 0.47× the number of field-programmable gate array slices of an accelerator with four ALUs. A geometric mean speedup of 1.78× over a four-issue very long instruction word (without floating-point support) was obtained for the integer kernels.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TVLSI.2016.2573640</doi><tpages>14</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1063-8210
ispartof	IEEE transactions on very large scale integration (VLSI) systems, 2017-01, Vol.25 (1), p.21-34
issn	1063-8210 1557-9999
language	eng
recordid	cdi_ieee_primary_7506263
source	IEEE Electronic Library (IEL)
subjects	Acceleration Accelerators Binary acceleration Binary codes Computer architecture coprocessor Coprocessors Customization Field programmable gate arrays field-programmable gate array (FPGA) Floating point arithmetic Hardware Integers Kernel Kernels loop accelerator Microprocessors modulo-scheduling Parallel processing Pipelining (computers) Registers Runtime very long instruction word (VLIW) VLIW
title	Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T06%3A43%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Generation%20of%20Customized%20Accelerators%20for%20Loop%20Pipelining%20of%20Binary%20Instruction%20Traces&rft.jtitle=IEEE%20transactions%20on%20very%20large%20scale%20integration%20(VLSI)%20systems&rft.au=Paulino,%20Nuno%20M.%20C.&rft.date=2017-01&rft.volume=25&rft.issue=1&rft.spage=21&rft.epage=34&rft.pages=21-34&rft.issn=1063-8210&rft.eissn=1557-9999&rft.coden=IEVSE9&rft_id=info:doi/10.1109/TVLSI.2016.2573640&rft_dat=%3Cproquest_RIE%3E1855465987%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1855465987&rft_id=info:pmid/&rft_ieee_id=7506263&rfr_iscdi=true