OUTRIDER: efficient memory latency tolerance with decoupled strands

We present OUTRIDER, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads. OUTRIDER enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computer architecture news 2011-06, Vol.39 (3), p.117-128
Hauptverfasser:	Crago, Neal Clayton, Patel, Sanjay Jeram
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	128
container_issue	3
container_start_page	117
container_title	Computer architecture news
container_volume	39
creator	Crago, Neal Clayton Patel, Sanjay Jeram
description	We present OUTRIDER, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads. OUTRIDER enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate memory-accessing and memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, OUTRIDER can tolerate memory latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that OUTRIDER can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, OUTRIDER achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.
doi_str_mv	10.1145/2024723.2000079
format	Article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_2024723_2000079</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_2024723_2000079</sourcerecordid><originalsourceid>FETCH-crossref_primary_10_1145_2024723_20000793</originalsourceid><addsrcrecordid>eNpjYBA3NNAzNDQx1TcyMDIxNzLWMzIAAnNLFgZOA0MzY11TSzMTDgau4uIsAyDf3NiAk4HDPzQkyNPFNYiHgTUtMac4lRdKczPou7mGOHvoJhflFxcXpabFFxRl5iYWVcYbGsSDLImHWhIPtcSYdB0ALEosZA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>OUTRIDER: efficient memory latency tolerance with decoupled strands</title><source>ACM Digital Library</source><creator>Crago, Neal Clayton ; Patel, Sanjay Jeram</creator><creatorcontrib>Crago, Neal Clayton ; Patel, Sanjay Jeram</creatorcontrib><description>We present OUTRIDER, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads. OUTRIDER enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate memory-accessing and memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, OUTRIDER can tolerate memory latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that OUTRIDER can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, OUTRIDER achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.</description><identifier>ISSN: 0163-5964</identifier><identifier>DOI: 10.1145/2024723.2000079</identifier><language>eng</language><ispartof>Computer architecture news, 2011-06, Vol.39 (3), p.117-128</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-crossref_primary_10_1145_2024723_20000793</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27923,27924</link.rule.ids></links><search><creatorcontrib>Crago, Neal Clayton</creatorcontrib><creatorcontrib>Patel, Sanjay Jeram</creatorcontrib><title>OUTRIDER: efficient memory latency tolerance with decoupled strands</title><title>Computer architecture news</title><description>We present OUTRIDER, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads. OUTRIDER enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate memory-accessing and memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, OUTRIDER can tolerate memory latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that OUTRIDER can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, OUTRIDER achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.</description><issn>0163-5964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2011</creationdate><recordtype>article</recordtype><recordid>eNpjYBA3NNAzNDQx1TcyMDIxNzLWMzIAAnNLFgZOA0MzY11TSzMTDgau4uIsAyDf3NiAk4HDPzQkyNPFNYiHgTUtMac4lRdKczPou7mGOHvoJhflFxcXpabFFxRl5iYWVcYbGsSDLImHWhIPtcSYdB0ALEosZA</recordid><startdate>20110622</startdate><enddate>20110622</enddate><creator>Crago, Neal Clayton</creator><creator>Patel, Sanjay Jeram</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20110622</creationdate><title>OUTRIDER</title><author>Crago, Neal Clayton ; Patel, Sanjay Jeram</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-crossref_primary_10_1145_2024723_20000793</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2011</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Crago, Neal Clayton</creatorcontrib><creatorcontrib>Patel, Sanjay Jeram</creatorcontrib><collection>CrossRef</collection><jtitle>Computer architecture news</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Crago, Neal Clayton</au><au>Patel, Sanjay Jeram</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>OUTRIDER: efficient memory latency tolerance with decoupled strands</atitle><jtitle>Computer architecture news</jtitle><date>2011-06-22</date><risdate>2011</risdate><volume>39</volume><issue>3</issue><spage>117</spage><epage>128</epage><pages>117-128</pages><issn>0163-5964</issn><abstract>We present OUTRIDER, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads. OUTRIDER enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate memory-accessing and memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, OUTRIDER can tolerate memory latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that OUTRIDER can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, OUTRIDER achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.</abstract><doi>10.1145/2024723.2000079</doi></addata></record>
fulltext	fulltext
identifier	ISSN: 0163-5964
ispartof	Computer architecture news, 2011-06, Vol.39 (3), p.117-128
issn	0163-5964
language	eng
recordid	cdi_crossref_primary_10_1145_2024723_2000079
source	ACM Digital Library
title	OUTRIDER: efficient memory latency tolerance with decoupled strands
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T12%3A23%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=OUTRIDER:%20efficient%20memory%20latency%20tolerance%20with%20decoupled%20strands&rft.jtitle=Computer%20architecture%20news&rft.au=Crago,%20Neal%20Clayton&rft.date=2011-06-22&rft.volume=39&rft.issue=3&rft.spage=117&rft.epage=128&rft.pages=117-128&rft.issn=0163-5964&rft_id=info:doi/10.1145/2024723.2000079&rft_dat=%3Ccrossref%3E10_1145_2024723_2000079%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true