Exploiting criticality to reduce bottlenecks in distributed uniprocessors

Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Robatmili, B, Govindan, S, Burger, D, Keckler, S W
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Bandwidth Benchmark testing Hardware Microarchitecture Multicore processing Pipelines Registers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	442
container_issue
container_start_page	431
container_title
container_volume
creator	Robatmili, B Govindan, S Burger, D Keckler, S W
description	Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.
doi_str_mv	10.1109/HPCA.2011.5749749
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5749749</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5749749</ieee_id><sourcerecordid>5749749</sourcerecordid><originalsourceid>FETCH-LOGICAL-i241t-5cdc4b71a103cf6885f621003d3a77fddfb4645c10076c6ccef583874b49267b3</originalsourceid><addsrcrecordid>eNpFkEtLw0AAhNcXGGt_gHjZP5C4780eS2htoaAHhd5K9iWrMQm7G7D_3oAFh4EPZmAOA8ADRhXGSD1tX5tVRRDGFZdMzb4Ad5gRxhSjTFyCglBZlwTRw9V_QQ7XoMCcohLVSt6CZUqfaJYQCnNSgN36Z-yGkEP_AU2cadou5BPMA4zOTsZBPeTcud6ZrwRDD21IOQY9ZWfh1IcxDsalNMR0D2582yW3PHMB3jfrt2Zb7l-ed81qXwbCcC65sYZpiVuMqPGirrkXBCNELW2l9NZ6zQTjZo6kMMIY53lNa8k0U0RITRfg8W83OOeOYwzfbTwdz5fQX5YbUiw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Robatmili, B ; Govindan, S ; Burger, D ; Keckler, S W</creator><creatorcontrib>Robatmili, B ; Govindan, S ; Burger, D ; Keckler, S W</creatorcontrib><description>Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.</description><identifier>ISSN: 1530-0897</identifier><identifier>ISBN: 142449432X</identifier><identifier>ISBN: 9781424494323</identifier><identifier>EISSN: 2378-203X</identifier><identifier>EISBN: 1424494346</identifier><identifier>EISBN: 1424494354</identifier><identifier>EISBN: 9781424494354</identifier><identifier>EISBN: 9781424494347</identifier><identifier>DOI: 10.1109/HPCA.2011.5749749</identifier><language>eng</language><publisher>IEEE</publisher><subject>Bandwidth ; Benchmark testing ; Hardware ; Microarchitecture ; Multicore processing ; Pipelines ; Registers</subject><ispartof>2011 IEEE 17th International Symposium on High Performance Computer Architecture, 2011, p.431-442</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5749749$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5749749$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Robatmili, B</creatorcontrib><creatorcontrib>Govindan, S</creatorcontrib><creatorcontrib>Burger, D</creatorcontrib><creatorcontrib>Keckler, S W</creatorcontrib><title>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</title><title>2011 IEEE 17th International Symposium on High Performance Computer Architecture</title><addtitle>HPCA</addtitle><description>Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.</description><subject>Bandwidth</subject><subject>Benchmark testing</subject><subject>Hardware</subject><subject>Microarchitecture</subject><subject>Multicore processing</subject><subject>Pipelines</subject><subject>Registers</subject><issn>1530-0897</issn><issn>2378-203X</issn><isbn>142449432X</isbn><isbn>9781424494323</isbn><isbn>1424494346</isbn><isbn>1424494354</isbn><isbn>9781424494354</isbn><isbn>9781424494347</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2011</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpFkEtLw0AAhNcXGGt_gHjZP5C4780eS2htoaAHhd5K9iWrMQm7G7D_3oAFh4EPZmAOA8ADRhXGSD1tX5tVRRDGFZdMzb4Ad5gRxhSjTFyCglBZlwTRw9V_QQ7XoMCcohLVSt6CZUqfaJYQCnNSgN36Z-yGkEP_AU2cadou5BPMA4zOTsZBPeTcud6ZrwRDD21IOQY9ZWfh1IcxDsalNMR0D2582yW3PHMB3jfrt2Zb7l-ed81qXwbCcC65sYZpiVuMqPGirrkXBCNELW2l9NZ6zQTjZo6kMMIY53lNa8k0U0RITRfg8W83OOeOYwzfbTwdz5fQX5YbUiw</recordid><startdate>201102</startdate><enddate>201102</enddate><creator>Robatmili, B</creator><creator>Govindan, S</creator><creator>Burger, D</creator><creator>Keckler, S W</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201102</creationdate><title>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</title><author>Robatmili, B ; Govindan, S ; Burger, D ; Keckler, S W</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i241t-5cdc4b71a103cf6885f621003d3a77fddfb4645c10076c6ccef583874b49267b3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Bandwidth</topic><topic>Benchmark testing</topic><topic>Hardware</topic><topic>Microarchitecture</topic><topic>Multicore processing</topic><topic>Pipelines</topic><topic>Registers</topic><toplevel>online_resources</toplevel><creatorcontrib>Robatmili, B</creatorcontrib><creatorcontrib>Govindan, S</creatorcontrib><creatorcontrib>Burger, D</creatorcontrib><creatorcontrib>Keckler, S W</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Robatmili, B</au><au>Govindan, S</au><au>Burger, D</au><au>Keckler, S W</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</atitle><btitle>2011 IEEE 17th International Symposium on High Performance Computer Architecture</btitle><stitle>HPCA</stitle><date>2011-02</date><risdate>2011</risdate><spage>431</spage><epage>442</epage><pages>431-442</pages><issn>1530-0897</issn><eissn>2378-203X</eissn><isbn>142449432X</isbn><isbn>9781424494323</isbn><eisbn>1424494346</eisbn><eisbn>1424494354</eisbn><eisbn>9781424494354</eisbn><eisbn>9781424494347</eisbn><abstract>Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.</abstract><pub>IEEE</pub><doi>10.1109/HPCA.2011.5749749</doi><tpages>12</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1530-0897
ispartof	2011 IEEE 17th International Symposium on High Performance Computer Architecture, 2011, p.431-442
issn	1530-0897 2378-203X
language	eng
recordid	cdi_ieee_primary_5749749
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Bandwidth Benchmark testing Hardware Microarchitecture Multicore processing Pipelines Registers
title	Exploiting criticality to reduce bottlenecks in distributed uniprocessors
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T00%3A46%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Exploiting%20criticality%20to%20reduce%20bottlenecks%20in%20distributed%20uniprocessors&rft.btitle=2011%20IEEE%2017th%20International%20Symposium%20on%20High%20Performance%20Computer%20Architecture&rft.au=Robatmili,%20B&rft.date=2011-02&rft.spage=431&rft.epage=442&rft.pages=431-442&rft.issn=1530-0897&rft.eissn=2378-203X&rft.isbn=142449432X&rft.isbn_list=9781424494323&rft_id=info:doi/10.1109/HPCA.2011.5749749&rft_dat=%3Cieee_6IE%3E5749749%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=1424494346&rft.eisbn_list=1424494354&rft.eisbn_list=9781424494354&rft.eisbn_list=9781424494347&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5749749&rfr_iscdi=true