Custom Multicache Architectures for Heap Manipulating Programs

Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an field-programmable gate array accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the opti...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computer-aided design of integrated circuits and systems 2017-05, Vol.36 (5), p.761-774
Hauptverfasser:	Winterstein, Felix, Fleming, Kermin E., Hsin-Jung Yang, Constantinides, George A.
Format:	Artikel
Sprache:	eng
Schlagworte:	Bandwidth Caching schemes Computer architecture Computer memory Data structures Design optimization dynamic data structures Field programmable gate arrays field-programmable gate array (FPGA) high-level synthesis (HLS) Level (quantity) Memory management memory system Optimization Program verification (computers) separation logic Software engineering Synchronism System-on-chip
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	774
container_issue	5
container_start_page	761
container_title	IEEE transactions on computer-aided design of integrated circuits and systems
container_volume	36
creator	Winterstein, Felix Fleming, Kermin E. Hsin-Jung Yang Constantinides, George A.
description	Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an field-programmable gate array accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that automatically generates parallel multicache systems which are tailored to the specific requirements of the application. Our program analysis identifies nonoverlapping memory regions, supported by private caches, and regions which are shared by parallel units after parallelization, which are supported by coherent caches and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this paper is the focus on programs using dynamically allocated, pointer-based data structures which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. Second, we devise a high-level cache performance estimation to find a heterogeneous configuration of cache sizes that maximizes the performance of the multicache system subject to an on-chip memory resource constraint. We demonstrate our technique with three case studies of applications using dynamic data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 15× speedup after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid multicache architecture.
doi_str_mv	10.1109/TCAD.2016.2608861
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_7565624</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7565624</ieee_id><sourcerecordid>1891107744</sourcerecordid><originalsourceid>FETCH-LOGICAL-c336t-71f559f1b41fb96f89e08fb201377c9d48df79609788d02e20af3f20fe6e9a7b3</originalsourceid><addsrcrecordid>eNo9kMtOwzAQRS0EEqXwAYhNJNYpM47jxwapCo8itYJFWVuOa7ep2ibYzoK_J1UrVndz7szVIeQeYYII6mlZTV8mFJBPKAcpOV6QEapC5AxLvCQjoELmAAKuyU2MWwBkJVUj8lz1MbX7bNHvUmON3bhsGuymSc6mPriY-TZkM2e6bGEOTdfvTGoO6-wrtOtg9vGWXHmzi-7unGPy_fa6rGb5_PP9o5rOc1sUPOUCfVkqjzVDXyvupXIgfT3sLYSwasXkygvFQQkpV0AdBeMLT8E77pQRdTEmj6e7XWh_eheT3rZ9OAwvNUo1GBCCsYHCE2VDG2NwXneh2ZvwqxH0UZM-atJHTfqsaeg8nDqNc-6fFyUvOWXFH5yQYtE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1891107744</pqid></control><display><type>article</type><title>Custom Multicache Architectures for Heap Manipulating Programs</title><source>IEEE Electronic Library (IEL)</source><creator>Winterstein, Felix ; Fleming, Kermin E. ; Hsin-Jung Yang ; Constantinides, George A.</creator><creatorcontrib>Winterstein, Felix ; Fleming, Kermin E. ; Hsin-Jung Yang ; Constantinides, George A.</creatorcontrib><description>Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an field-programmable gate array accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that automatically generates parallel multicache systems which are tailored to the specific requirements of the application. Our program analysis identifies nonoverlapping memory regions, supported by private caches, and regions which are shared by parallel units after parallelization, which are supported by coherent caches and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this paper is the focus on programs using dynamically allocated, pointer-based data structures which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. Second, we devise a high-level cache performance estimation to find a heterogeneous configuration of cache sizes that maximizes the performance of the multicache system subject to an on-chip memory resource constraint. We demonstrate our technique with three case studies of applications using dynamic data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 15× speedup after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid multicache architecture.</description><identifier>ISSN: 0278-0070</identifier><identifier>EISSN: 1937-4151</identifier><identifier>DOI: 10.1109/TCAD.2016.2608861</identifier><identifier>CODEN: ITCSDI</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Bandwidth ; Caching schemes ; Computer architecture ; Computer memory ; Data structures ; Design optimization ; dynamic data structures ; Field programmable gate arrays ; field-programmable gate array (FPGA) ; high-level synthesis (HLS) ; Level (quantity) ; Memory management ; memory system ; Optimization ; Program verification (computers) ; separation logic ; Software engineering ; Synchronism ; System-on-chip</subject><ispartof>IEEE transactions on computer-aided design of integrated circuits and systems, 2017-05, Vol.36 (5), p.761-774</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2017</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c336t-71f559f1b41fb96f89e08fb201377c9d48df79609788d02e20af3f20fe6e9a7b3</citedby><cites>FETCH-LOGICAL-c336t-71f559f1b41fb96f89e08fb201377c9d48df79609788d02e20af3f20fe6e9a7b3</cites><orcidid>0000-0002-2525-0693</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7565624$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7565624$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Winterstein, Felix</creatorcontrib><creatorcontrib>Fleming, Kermin E.</creatorcontrib><creatorcontrib>Hsin-Jung Yang</creatorcontrib><creatorcontrib>Constantinides, George A.</creatorcontrib><title>Custom Multicache Architectures for Heap Manipulating Programs</title><title>IEEE transactions on computer-aided design of integrated circuits and systems</title><addtitle>TCAD</addtitle><description>Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an field-programmable gate array accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that automatically generates parallel multicache systems which are tailored to the specific requirements of the application. Our program analysis identifies nonoverlapping memory regions, supported by private caches, and regions which are shared by parallel units after parallelization, which are supported by coherent caches and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this paper is the focus on programs using dynamically allocated, pointer-based data structures which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. Second, we devise a high-level cache performance estimation to find a heterogeneous configuration of cache sizes that maximizes the performance of the multicache system subject to an on-chip memory resource constraint. We demonstrate our technique with three case studies of applications using dynamic data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 15× speedup after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid multicache architecture.</description><subject>Bandwidth</subject><subject>Caching schemes</subject><subject>Computer architecture</subject><subject>Computer memory</subject><subject>Data structures</subject><subject>Design optimization</subject><subject>dynamic data structures</subject><subject>Field programmable gate arrays</subject><subject>field-programmable gate array (FPGA)</subject><subject>high-level synthesis (HLS)</subject><subject>Level (quantity)</subject><subject>Memory management</subject><subject>memory system</subject><subject>Optimization</subject><subject>Program verification (computers)</subject><subject>separation logic</subject><subject>Software engineering</subject><subject>Synchronism</subject><subject>System-on-chip</subject><issn>0278-0070</issn><issn>1937-4151</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kMtOwzAQRS0EEqXwAYhNJNYpM47jxwapCo8itYJFWVuOa7ep2ibYzoK_J1UrVndz7szVIeQeYYII6mlZTV8mFJBPKAcpOV6QEapC5AxLvCQjoELmAAKuyU2MWwBkJVUj8lz1MbX7bNHvUmON3bhsGuymSc6mPriY-TZkM2e6bGEOTdfvTGoO6-wrtOtg9vGWXHmzi-7unGPy_fa6rGb5_PP9o5rOc1sUPOUCfVkqjzVDXyvupXIgfT3sLYSwasXkygvFQQkpV0AdBeMLT8E77pQRdTEmj6e7XWh_eheT3rZ9OAwvNUo1GBCCsYHCE2VDG2NwXneh2ZvwqxH0UZM-atJHTfqsaeg8nDqNc-6fFyUvOWXFH5yQYtE</recordid><startdate>20170501</startdate><enddate>20170501</enddate><creator>Winterstein, Felix</creator><creator>Fleming, Kermin E.</creator><creator>Hsin-Jung Yang</creator><creator>Constantinides, George A.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-2525-0693</orcidid></search><sort><creationdate>20170501</creationdate><title>Custom Multicache Architectures for Heap Manipulating Programs</title><author>Winterstein, Felix ; Fleming, Kermin E. ; Hsin-Jung Yang ; Constantinides, George A.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c336t-71f559f1b41fb96f89e08fb201377c9d48df79609788d02e20af3f20fe6e9a7b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Bandwidth</topic><topic>Caching schemes</topic><topic>Computer architecture</topic><topic>Computer memory</topic><topic>Data structures</topic><topic>Design optimization</topic><topic>dynamic data structures</topic><topic>Field programmable gate arrays</topic><topic>field-programmable gate array (FPGA)</topic><topic>high-level synthesis (HLS)</topic><topic>Level (quantity)</topic><topic>Memory management</topic><topic>memory system</topic><topic>Optimization</topic><topic>Program verification (computers)</topic><topic>separation logic</topic><topic>Software engineering</topic><topic>Synchronism</topic><topic>System-on-chip</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Winterstein, Felix</creatorcontrib><creatorcontrib>Fleming, Kermin E.</creatorcontrib><creatorcontrib>Hsin-Jung Yang</creatorcontrib><creatorcontrib>Constantinides, George A.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on computer-aided design of integrated circuits and systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Winterstein, Felix</au><au>Fleming, Kermin E.</au><au>Hsin-Jung Yang</au><au>Constantinides, George A.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Custom Multicache Architectures for Heap Manipulating Programs</atitle><jtitle>IEEE transactions on computer-aided design of integrated circuits and systems</jtitle><stitle>TCAD</stitle><date>2017-05-01</date><risdate>2017</risdate><volume>36</volume><issue>5</issue><spage>761</spage><epage>774</epage><pages>761-774</pages><issn>0278-0070</issn><eissn>1937-4151</eissn><coden>ITCSDI</coden><abstract>Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an field-programmable gate array accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that automatically generates parallel multicache systems which are tailored to the specific requirements of the application. Our program analysis identifies nonoverlapping memory regions, supported by private caches, and regions which are shared by parallel units after parallelization, which are supported by coherent caches and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this paper is the focus on programs using dynamically allocated, pointer-based data structures which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. Second, we devise a high-level cache performance estimation to find a heterogeneous configuration of cache sizes that maximizes the performance of the multicache system subject to an on-chip memory resource constraint. We demonstrate our technique with three case studies of applications using dynamic data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 15× speedup after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid multicache architecture.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCAD.2016.2608861</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-2525-0693</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0278-0070
ispartof	IEEE transactions on computer-aided design of integrated circuits and systems, 2017-05, Vol.36 (5), p.761-774
issn	0278-0070 1937-4151
language	eng
recordid	cdi_ieee_primary_7565624
source	IEEE Electronic Library (IEL)
subjects	Bandwidth Caching schemes Computer architecture Computer memory Data structures Design optimization dynamic data structures Field programmable gate arrays field-programmable gate array (FPGA) high-level synthesis (HLS) Level (quantity) Memory management memory system Optimization Program verification (computers) separation logic Software engineering Synchronism System-on-chip
title	Custom Multicache Architectures for Heap Manipulating Programs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T11%3A03%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Custom%20Multicache%20Architectures%20for%20Heap%20Manipulating%20Programs&rft.jtitle=IEEE%20transactions%20on%20computer-aided%20design%20of%20integrated%20circuits%20and%20systems&rft.au=Winterstein,%20Felix&rft.date=2017-05-01&rft.volume=36&rft.issue=5&rft.spage=761&rft.epage=774&rft.pages=761-774&rft.issn=0278-0070&rft.eissn=1937-4151&rft.coden=ITCSDI&rft_id=info:doi/10.1109/TCAD.2016.2608861&rft_dat=%3Cproquest_RIE%3E1891107744%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1891107744&rft_id=info:pmid/&rft_ieee_id=7565624&rfr_iscdi=true