Strategies for mapping dataflow blocks to distributed hardware

Distributed processors must balance communication and concurrency. When dividing instructions among the processors, key factors are the available concurrency, criticality of dependence chains, and communication penalties. The amount of concurrency determines the importance of the other factors: if c...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Robatmili, Behnam, Coons, Katherine E., Burger, Doug, McKinley, Kathryn S.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 34
container_issue
container_start_page 23
container_title
container_volume
creator Robatmili, Behnam
Coons, Katherine E.
Burger, Doug
McKinley, Kathryn S.
description Distributed processors must balance communication and concurrency. When dividing instructions among the processors, key factors are the available concurrency, criticality of dependence chains, and communication penalties. The amount of concurrency determines the importance of the other factors: if concurrency is high, wider distribution of instructions is likely to tolerate the increased operand routing latencies. If concurrency is low, mapping dependent instructions close to one another is likely to reduce communication costs that contribute to the critical path. This paper explores these tradeoffs for distributed Explicit Dataflow Graph Execution (EDGE) architectures that execute blocks of dataflow instructions atomically. A runtime block mapper assigns instructions from a single thread to distributed hardware resources (cores) based on compiler-assigned instruction identifiers. We explore two approaches: fixed strategies that map all blocks to the same number of cores, and adaptive strategies that vary the number of cores for each block. The results show that best fixed strategy varies, based on the cores’ issue width. A simple adaptive strategy improves performance over the best fixed strategies for single and dual-issue cores, but its benefits decrease as the cores’ issue width increases. These results show that by choosing an appropriate runtime block mapping strategy, average performance can be increased by 18%, while simultaneously reducing average operand communication by 70%, saving energy as well as improving performance. These results indicate that runtime block mapping is a promising mechanism for balancing communication and concurrency in distributed processors.
doi_str_mv 10.1109/MICRO.2008.4771776
format Conference Proceeding
fullrecord <record><control><sourceid>acm_6IE</sourceid><recordid>TN_cdi_acm_books_10_1109_MICRO_2008_4771776</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4771776</ieee_id><sourcerecordid>acm_books_10_1109_MICRO_2008_4771776</sourcerecordid><originalsourceid>FETCH-LOGICAL-a253t-9b0749f3205e49c786740ec4f8b977b196c136b089ea7be4b7a9a43fa51ae9ab3</originalsourceid><addsrcrecordid>eNqNkE9LAzEUxCMqWGq_gF72LluTTXbf5iJIsVqoFPxzDi-7LzW2dUsSKX57W9uD3pzLMMwwhx9jF4IPheD6-nEyepoNC87roQIQANURG2iohSqUKmoJ8vhPrqoT1hMcilypUpyxQYzvfCtVSs15j908p4CJ5p5i5rqQrXC99h_zrMWEbtltMrvsmkXMUpe1Pqbg7WeiNnvD0G4w0Dk7dbiMNDh4n72O715GD_l0dj8Z3U5zLEqZcm05KO1kwUtSuoG6AsWpUa62GsAKXTVCVpbXmhAsKQuoUUmHpUDSaGWfXe5_PRGZdfArDF_mQGDbXu1bbFbGdt0iGsHNjpf54WV2vH6txf_XxgZPTn4DHBNoTA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Strategies for mapping dataflow blocks to distributed hardware</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Robatmili, Behnam ; Coons, Katherine E. ; Burger, Doug ; McKinley, Kathryn S.</creator><creatorcontrib>Robatmili, Behnam ; Coons, Katherine E. ; Burger, Doug ; McKinley, Kathryn S.</creatorcontrib><description>Distributed processors must balance communication and concurrency. When dividing instructions among the processors, key factors are the available concurrency, criticality of dependence chains, and communication penalties. The amount of concurrency determines the importance of the other factors: if concurrency is high, wider distribution of instructions is likely to tolerate the increased operand routing latencies. If concurrency is low, mapping dependent instructions close to one another is likely to reduce communication costs that contribute to the critical path. This paper explores these tradeoffs for distributed Explicit Dataflow Graph Execution (EDGE) architectures that execute blocks of dataflow instructions atomically. A runtime block mapper assigns instructions from a single thread to distributed hardware resources (cores) based on compiler-assigned instruction identifiers. We explore two approaches: fixed strategies that map all blocks to the same number of cores, and adaptive strategies that vary the number of cores for each block. The results show that best fixed strategy varies, based on the cores’ issue width. A simple adaptive strategy improves performance over the best fixed strategies for single and dual-issue cores, but its benefits decrease as the cores’ issue width increases. These results show that by choosing an appropriate runtime block mapping strategy, average performance can be increased by 18%, while simultaneously reducing average operand communication by 70%, saving energy as well as improving performance. These results indicate that runtime block mapping is a promising mechanism for balancing communication and concurrency in distributed processors.</description><identifier>ISSN: 1072-4451</identifier><identifier>ISBN: 9781424428366</identifier><identifier>ISBN: 142442836X</identifier><identifier>EISBN: 9781424428373</identifier><identifier>EISBN: 1424428378</identifier><identifier>DOI: 10.1109/MICRO.2008.4771776</identifier><language>eng</language><publisher>Washington, DC, USA: IEEE Computer Society</publisher><subject>Concurrent computing ; Costs ; Delay ; Distributed computing ; General and reference ; General and reference -- Cross-computing tools and techniques ; General and reference -- Cross-computing tools and techniques -- Performance ; Hardware ; Instruction sets ; Intrusion detection ; Parallel processing ; Routing ; Runtime</subject><ispartof>2008 41st IEEE/ACM International Symposium on Microarchitecture, 2008, p.23-34</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4771776$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4771776$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Robatmili, Behnam</creatorcontrib><creatorcontrib>Coons, Katherine E.</creatorcontrib><creatorcontrib>Burger, Doug</creatorcontrib><creatorcontrib>McKinley, Kathryn S.</creatorcontrib><title>Strategies for mapping dataflow blocks to distributed hardware</title><title>2008 41st IEEE/ACM International Symposium on Microarchitecture</title><addtitle>MICRO</addtitle><description>Distributed processors must balance communication and concurrency. When dividing instructions among the processors, key factors are the available concurrency, criticality of dependence chains, and communication penalties. The amount of concurrency determines the importance of the other factors: if concurrency is high, wider distribution of instructions is likely to tolerate the increased operand routing latencies. If concurrency is low, mapping dependent instructions close to one another is likely to reduce communication costs that contribute to the critical path. This paper explores these tradeoffs for distributed Explicit Dataflow Graph Execution (EDGE) architectures that execute blocks of dataflow instructions atomically. A runtime block mapper assigns instructions from a single thread to distributed hardware resources (cores) based on compiler-assigned instruction identifiers. We explore two approaches: fixed strategies that map all blocks to the same number of cores, and adaptive strategies that vary the number of cores for each block. The results show that best fixed strategy varies, based on the cores’ issue width. A simple adaptive strategy improves performance over the best fixed strategies for single and dual-issue cores, but its benefits decrease as the cores’ issue width increases. These results show that by choosing an appropriate runtime block mapping strategy, average performance can be increased by 18%, while simultaneously reducing average operand communication by 70%, saving energy as well as improving performance. These results indicate that runtime block mapping is a promising mechanism for balancing communication and concurrency in distributed processors.</description><subject>Concurrent computing</subject><subject>Costs</subject><subject>Delay</subject><subject>Distributed computing</subject><subject>General and reference</subject><subject>General and reference -- Cross-computing tools and techniques</subject><subject>General and reference -- Cross-computing tools and techniques -- Performance</subject><subject>Hardware</subject><subject>Instruction sets</subject><subject>Intrusion detection</subject><subject>Parallel processing</subject><subject>Routing</subject><subject>Runtime</subject><issn>1072-4451</issn><isbn>9781424428366</isbn><isbn>142442836X</isbn><isbn>9781424428373</isbn><isbn>1424428378</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2008</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNqNkE9LAzEUxCMqWGq_gF72LluTTXbf5iJIsVqoFPxzDi-7LzW2dUsSKX57W9uD3pzLMMwwhx9jF4IPheD6-nEyepoNC87roQIQANURG2iohSqUKmoJ8vhPrqoT1hMcilypUpyxQYzvfCtVSs15j908p4CJ5p5i5rqQrXC99h_zrMWEbtltMrvsmkXMUpe1Pqbg7WeiNnvD0G4w0Dk7dbiMNDh4n72O715GD_l0dj8Z3U5zLEqZcm05KO1kwUtSuoG6AsWpUa62GsAKXTVCVpbXmhAsKQuoUUmHpUDSaGWfXe5_PRGZdfArDF_mQGDbXu1bbFbGdt0iGsHNjpf54WV2vH6txf_XxgZPTn4DHBNoTA</recordid><startdate>20081108</startdate><enddate>20081108</enddate><creator>Robatmili, Behnam</creator><creator>Coons, Katherine E.</creator><creator>Burger, Doug</creator><creator>McKinley, Kathryn S.</creator><general>IEEE Computer Society</general><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20081108</creationdate><title>Strategies for mapping dataflow blocks to distributed hardware</title><author>Robatmili, Behnam ; Coons, Katherine E. ; Burger, Doug ; McKinley, Kathryn S.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a253t-9b0749f3205e49c786740ec4f8b977b196c136b089ea7be4b7a9a43fa51ae9ab3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2008</creationdate><topic>Concurrent computing</topic><topic>Costs</topic><topic>Delay</topic><topic>Distributed computing</topic><topic>General and reference</topic><topic>General and reference -- Cross-computing tools and techniques</topic><topic>General and reference -- Cross-computing tools and techniques -- Performance</topic><topic>Hardware</topic><topic>Instruction sets</topic><topic>Intrusion detection</topic><topic>Parallel processing</topic><topic>Routing</topic><topic>Runtime</topic><toplevel>online_resources</toplevel><creatorcontrib>Robatmili, Behnam</creatorcontrib><creatorcontrib>Coons, Katherine E.</creatorcontrib><creatorcontrib>Burger, Doug</creatorcontrib><creatorcontrib>McKinley, Kathryn S.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Robatmili, Behnam</au><au>Coons, Katherine E.</au><au>Burger, Doug</au><au>McKinley, Kathryn S.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Strategies for mapping dataflow blocks to distributed hardware</atitle><btitle>2008 41st IEEE/ACM International Symposium on Microarchitecture</btitle><stitle>MICRO</stitle><date>2008-11-08</date><risdate>2008</risdate><spage>23</spage><epage>34</epage><pages>23-34</pages><issn>1072-4451</issn><isbn>9781424428366</isbn><isbn>142442836X</isbn><eisbn>9781424428373</eisbn><eisbn>1424428378</eisbn><abstract>Distributed processors must balance communication and concurrency. When dividing instructions among the processors, key factors are the available concurrency, criticality of dependence chains, and communication penalties. The amount of concurrency determines the importance of the other factors: if concurrency is high, wider distribution of instructions is likely to tolerate the increased operand routing latencies. If concurrency is low, mapping dependent instructions close to one another is likely to reduce communication costs that contribute to the critical path. This paper explores these tradeoffs for distributed Explicit Dataflow Graph Execution (EDGE) architectures that execute blocks of dataflow instructions atomically. A runtime block mapper assigns instructions from a single thread to distributed hardware resources (cores) based on compiler-assigned instruction identifiers. We explore two approaches: fixed strategies that map all blocks to the same number of cores, and adaptive strategies that vary the number of cores for each block. The results show that best fixed strategy varies, based on the cores’ issue width. A simple adaptive strategy improves performance over the best fixed strategies for single and dual-issue cores, but its benefits decrease as the cores’ issue width increases. These results show that by choosing an appropriate runtime block mapping strategy, average performance can be increased by 18%, while simultaneously reducing average operand communication by 70%, saving energy as well as improving performance. These results indicate that runtime block mapping is a promising mechanism for balancing communication and concurrency in distributed processors.</abstract><cop>Washington, DC, USA</cop><pub>IEEE Computer Society</pub><doi>10.1109/MICRO.2008.4771776</doi><tpages>12</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1072-4451
ispartof 2008 41st IEEE/ACM International Symposium on Microarchitecture, 2008, p.23-34
issn 1072-4451
language eng
recordid cdi_acm_books_10_1109_MICRO_2008_4771776
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Concurrent computing
Costs
Delay
Distributed computing
General and reference
General and reference -- Cross-computing tools and techniques
General and reference -- Cross-computing tools and techniques -- Performance
Hardware
Instruction sets
Intrusion detection
Parallel processing
Routing
Runtime
title Strategies for mapping dataflow blocks to distributed hardware
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-11T15%3A29%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Strategies%20for%20mapping%20dataflow%20blocks%20to%20distributed%20hardware&rft.btitle=2008%2041st%20IEEE/ACM%20International%20Symposium%20on%20Microarchitecture&rft.au=Robatmili,%20Behnam&rft.date=2008-11-08&rft.spage=23&rft.epage=34&rft.pages=23-34&rft.issn=1072-4451&rft.isbn=9781424428366&rft.isbn_list=142442836X&rft_id=info:doi/10.1109/MICRO.2008.4771776&rft_dat=%3Cacm_6IE%3Eacm_books_10_1109_MICRO_2008_4771776%3C/acm_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781424428373&rft.eisbn_list=1424428378&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=4771776&rfr_iscdi=true