Exploiting criticality to reduce bottlenecks in distributed uniprocessors

Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Robatmili, B, Govindan, S, Burger, D, Keckler, S W
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 442
container_issue
container_start_page 431
container_title
container_volume
creator Robatmili, B
Govindan, S
Burger, D
Keckler, S W
description Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.
doi_str_mv 10.1109/HPCA.2011.5749749
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5749749</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5749749</ieee_id><sourcerecordid>5749749</sourcerecordid><originalsourceid>FETCH-LOGICAL-i241t-5cdc4b71a103cf6885f621003d3a77fddfb4645c10076c6ccef583874b49267b3</originalsourceid><addsrcrecordid>eNpFkEtLw0AAhNcXGGt_gHjZP5C4780eS2htoaAHhd5K9iWrMQm7G7D_3oAFh4EPZmAOA8ADRhXGSD1tX5tVRRDGFZdMzb4Ad5gRxhSjTFyCglBZlwTRw9V_QQ7XoMCcohLVSt6CZUqfaJYQCnNSgN36Z-yGkEP_AU2cadou5BPMA4zOTsZBPeTcud6ZrwRDD21IOQY9ZWfh1IcxDsalNMR0D2582yW3PHMB3jfrt2Zb7l-ed81qXwbCcC65sYZpiVuMqPGirrkXBCNELW2l9NZ6zQTjZo6kMMIY53lNa8k0U0RITRfg8W83OOeOYwzfbTwdz5fQX5YbUiw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Robatmili, B ; Govindan, S ; Burger, D ; Keckler, S W</creator><creatorcontrib>Robatmili, B ; Govindan, S ; Burger, D ; Keckler, S W</creatorcontrib><description>Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.</description><identifier>ISSN: 1530-0897</identifier><identifier>ISBN: 142449432X</identifier><identifier>ISBN: 9781424494323</identifier><identifier>EISSN: 2378-203X</identifier><identifier>EISBN: 1424494346</identifier><identifier>EISBN: 1424494354</identifier><identifier>EISBN: 9781424494354</identifier><identifier>EISBN: 9781424494347</identifier><identifier>DOI: 10.1109/HPCA.2011.5749749</identifier><language>eng</language><publisher>IEEE</publisher><subject>Bandwidth ; Benchmark testing ; Hardware ; Microarchitecture ; Multicore processing ; Pipelines ; Registers</subject><ispartof>2011 IEEE 17th International Symposium on High Performance Computer Architecture, 2011, p.431-442</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5749749$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5749749$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Robatmili, B</creatorcontrib><creatorcontrib>Govindan, S</creatorcontrib><creatorcontrib>Burger, D</creatorcontrib><creatorcontrib>Keckler, S W</creatorcontrib><title>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</title><title>2011 IEEE 17th International Symposium on High Performance Computer Architecture</title><addtitle>HPCA</addtitle><description>Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.</description><subject>Bandwidth</subject><subject>Benchmark testing</subject><subject>Hardware</subject><subject>Microarchitecture</subject><subject>Multicore processing</subject><subject>Pipelines</subject><subject>Registers</subject><issn>1530-0897</issn><issn>2378-203X</issn><isbn>142449432X</isbn><isbn>9781424494323</isbn><isbn>1424494346</isbn><isbn>1424494354</isbn><isbn>9781424494354</isbn><isbn>9781424494347</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2011</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpFkEtLw0AAhNcXGGt_gHjZP5C4780eS2htoaAHhd5K9iWrMQm7G7D_3oAFh4EPZmAOA8ADRhXGSD1tX5tVRRDGFZdMzb4Ad5gRxhSjTFyCglBZlwTRw9V_QQ7XoMCcohLVSt6CZUqfaJYQCnNSgN36Z-yGkEP_AU2cadou5BPMA4zOTsZBPeTcud6ZrwRDD21IOQY9ZWfh1IcxDsalNMR0D2582yW3PHMB3jfrt2Zb7l-ed81qXwbCcC65sYZpiVuMqPGirrkXBCNELW2l9NZ6zQTjZo6kMMIY53lNa8k0U0RITRfg8W83OOeOYwzfbTwdz5fQX5YbUiw</recordid><startdate>201102</startdate><enddate>201102</enddate><creator>Robatmili, B</creator><creator>Govindan, S</creator><creator>Burger, D</creator><creator>Keckler, S W</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201102</creationdate><title>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</title><author>Robatmili, B ; Govindan, S ; Burger, D ; Keckler, S W</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i241t-5cdc4b71a103cf6885f621003d3a77fddfb4645c10076c6ccef583874b49267b3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Bandwidth</topic><topic>Benchmark testing</topic><topic>Hardware</topic><topic>Microarchitecture</topic><topic>Multicore processing</topic><topic>Pipelines</topic><topic>Registers</topic><toplevel>online_resources</toplevel><creatorcontrib>Robatmili, B</creatorcontrib><creatorcontrib>Govindan, S</creatorcontrib><creatorcontrib>Burger, D</creatorcontrib><creatorcontrib>Keckler, S W</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Robatmili, B</au><au>Govindan, S</au><au>Burger, D</au><au>Keckler, S W</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</atitle><btitle>2011 IEEE 17th International Symposium on High Performance Computer Architecture</btitle><stitle>HPCA</stitle><date>2011-02</date><risdate>2011</risdate><spage>431</spage><epage>442</epage><pages>431-442</pages><issn>1530-0897</issn><eissn>2378-203X</eissn><isbn>142449432X</isbn><isbn>9781424494323</isbn><eisbn>1424494346</eisbn><eisbn>1424494354</eisbn><eisbn>9781424494354</eisbn><eisbn>9781424494347</eisbn><abstract>Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.</abstract><pub>IEEE</pub><doi>10.1109/HPCA.2011.5749749</doi><tpages>12</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1530-0897
ispartof 2011 IEEE 17th International Symposium on High Performance Computer Architecture, 2011, p.431-442
issn 1530-0897
2378-203X
language eng
recordid cdi_ieee_primary_5749749
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Bandwidth
Benchmark testing
Hardware
Microarchitecture
Multicore processing
Pipelines
Registers
title Exploiting criticality to reduce bottlenecks in distributed uniprocessors
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T00%3A46%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Exploiting%20criticality%20to%20reduce%20bottlenecks%20in%20distributed%20uniprocessors&rft.btitle=2011%20IEEE%2017th%20International%20Symposium%20on%20High%20Performance%20Computer%20Architecture&rft.au=Robatmili,%20B&rft.date=2011-02&rft.spage=431&rft.epage=442&rft.pages=431-442&rft.issn=1530-0897&rft.eissn=2378-203X&rft.isbn=142449432X&rft.isbn_list=9781424494323&rft_id=info:doi/10.1109/HPCA.2011.5749749&rft_dat=%3Cieee_6IE%3E5749749%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=1424494346&rft.eisbn_list=1424494354&rft.eisbn_list=9781424494354&rft.eisbn_list=9781424494347&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5749749&rfr_iscdi=true