Exploiting criticality to reduce bottlenecks in distributed uniprocessors
Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multi...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 442 |
---|---|
container_issue | |
container_start_page | 431 |
container_title | |
container_volume | |
creator | Robatmili, B Govindan, S Burger, D Keckler, S W |
description | Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores. |
doi_str_mv | 10.1109/HPCA.2011.5749749 |
format | Conference Proceeding |
fullrecord | <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5749749</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5749749</ieee_id><sourcerecordid>5749749</sourcerecordid><originalsourceid>FETCH-LOGICAL-i241t-5cdc4b71a103cf6885f621003d3a77fddfb4645c10076c6ccef583874b49267b3</originalsourceid><addsrcrecordid>eNpFkEtLw0AAhNcXGGt_gHjZP5C4780eS2htoaAHhd5K9iWrMQm7G7D_3oAFh4EPZmAOA8ADRhXGSD1tX5tVRRDGFZdMzb4Ad5gRxhSjTFyCglBZlwTRw9V_QQ7XoMCcohLVSt6CZUqfaJYQCnNSgN36Z-yGkEP_AU2cadou5BPMA4zOTsZBPeTcud6ZrwRDD21IOQY9ZWfh1IcxDsalNMR0D2582yW3PHMB3jfrt2Zb7l-ed81qXwbCcC65sYZpiVuMqPGirrkXBCNELW2l9NZ6zQTjZo6kMMIY53lNa8k0U0RITRfg8W83OOeOYwzfbTwdz5fQX5YbUiw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Robatmili, B ; Govindan, S ; Burger, D ; Keckler, S W</creator><creatorcontrib>Robatmili, B ; Govindan, S ; Burger, D ; Keckler, S W</creatorcontrib><description>Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.</description><identifier>ISSN: 1530-0897</identifier><identifier>ISBN: 142449432X</identifier><identifier>ISBN: 9781424494323</identifier><identifier>EISSN: 2378-203X</identifier><identifier>EISBN: 1424494346</identifier><identifier>EISBN: 1424494354</identifier><identifier>EISBN: 9781424494354</identifier><identifier>EISBN: 9781424494347</identifier><identifier>DOI: 10.1109/HPCA.2011.5749749</identifier><language>eng</language><publisher>IEEE</publisher><subject>Bandwidth ; Benchmark testing ; Hardware ; Microarchitecture ; Multicore processing ; Pipelines ; Registers</subject><ispartof>2011 IEEE 17th International Symposium on High Performance Computer Architecture, 2011, p.431-442</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5749749$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5749749$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Robatmili, B</creatorcontrib><creatorcontrib>Govindan, S</creatorcontrib><creatorcontrib>Burger, D</creatorcontrib><creatorcontrib>Keckler, S W</creatorcontrib><title>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</title><title>2011 IEEE 17th International Symposium on High Performance Computer Architecture</title><addtitle>HPCA</addtitle><description>Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.</description><subject>Bandwidth</subject><subject>Benchmark testing</subject><subject>Hardware</subject><subject>Microarchitecture</subject><subject>Multicore processing</subject><subject>Pipelines</subject><subject>Registers</subject><issn>1530-0897</issn><issn>2378-203X</issn><isbn>142449432X</isbn><isbn>9781424494323</isbn><isbn>1424494346</isbn><isbn>1424494354</isbn><isbn>9781424494354</isbn><isbn>9781424494347</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2011</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpFkEtLw0AAhNcXGGt_gHjZP5C4780eS2htoaAHhd5K9iWrMQm7G7D_3oAFh4EPZmAOA8ADRhXGSD1tX5tVRRDGFZdMzb4Ad5gRxhSjTFyCglBZlwTRw9V_QQ7XoMCcohLVSt6CZUqfaJYQCnNSgN36Z-yGkEP_AU2cadou5BPMA4zOTsZBPeTcud6ZrwRDD21IOQY9ZWfh1IcxDsalNMR0D2582yW3PHMB3jfrt2Zb7l-ed81qXwbCcC65sYZpiVuMqPGirrkXBCNELW2l9NZ6zQTjZo6kMMIY53lNa8k0U0RITRfg8W83OOeOYwzfbTwdz5fQX5YbUiw</recordid><startdate>201102</startdate><enddate>201102</enddate><creator>Robatmili, B</creator><creator>Govindan, S</creator><creator>Burger, D</creator><creator>Keckler, S W</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201102</creationdate><title>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</title><author>Robatmili, B ; Govindan, S ; Burger, D ; Keckler, S W</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i241t-5cdc4b71a103cf6885f621003d3a77fddfb4645c10076c6ccef583874b49267b3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Bandwidth</topic><topic>Benchmark testing</topic><topic>Hardware</topic><topic>Microarchitecture</topic><topic>Multicore processing</topic><topic>Pipelines</topic><topic>Registers</topic><toplevel>online_resources</toplevel><creatorcontrib>Robatmili, B</creatorcontrib><creatorcontrib>Govindan, S</creatorcontrib><creatorcontrib>Burger, D</creatorcontrib><creatorcontrib>Keckler, S W</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Robatmili, B</au><au>Govindan, S</au><au>Burger, D</au><au>Keckler, S W</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Exploiting criticality to reduce bottlenecks in distributed uniprocessors</atitle><btitle>2011 IEEE 17th International Symposium on High Performance Computer Architecture</btitle><stitle>HPCA</stitle><date>2011-02</date><risdate>2011</risdate><spage>431</spage><epage>442</epage><pages>431-442</pages><issn>1530-0897</issn><eissn>2378-203X</eissn><isbn>142449432X</isbn><isbn>9781424494323</isbn><eisbn>1424494346</eisbn><eisbn>1424494354</eisbn><eisbn>9781424494354</eisbn><eisbn>9781424494347</eisbn><abstract>Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.</abstract><pub>IEEE</pub><doi>10.1109/HPCA.2011.5749749</doi><tpages>12</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1530-0897 |
ispartof | 2011 IEEE 17th International Symposium on High Performance Computer Architecture, 2011, p.431-442 |
issn | 1530-0897 2378-203X |
language | eng |
recordid | cdi_ieee_primary_5749749 |
source | IEEE Electronic Library (IEL) Conference Proceedings |
subjects | Bandwidth Benchmark testing Hardware Microarchitecture Multicore processing Pipelines Registers |
title | Exploiting criticality to reduce bottlenecks in distributed uniprocessors |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T00%3A46%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Exploiting%20criticality%20to%20reduce%20bottlenecks%20in%20distributed%20uniprocessors&rft.btitle=2011%20IEEE%2017th%20International%20Symposium%20on%20High%20Performance%20Computer%20Architecture&rft.au=Robatmili,%20B&rft.date=2011-02&rft.spage=431&rft.epage=442&rft.pages=431-442&rft.issn=1530-0897&rft.eissn=2378-203X&rft.isbn=142449432X&rft.isbn_list=9781424494323&rft_id=info:doi/10.1109/HPCA.2011.5749749&rft_dat=%3Cieee_6IE%3E5749749%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=1424494346&rft.eisbn_list=1424494354&rft.eisbn_list=9781424494354&rft.eisbn_list=9781424494347&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5749749&rfr_iscdi=true |