Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Kim, Yoongu, Papamichael, Michael, Mutlu, Onur, Harchol-Balter, Mor
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Bandwidth Clustering algorithms fairness Hardware > Hardware validation Hardware > Integrated circuits > Semiconductor memory Instruction sets Interference memory access behavior memory scheduling niceness Scheduling algorithm system throughput thread cluster Throughput
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	76
container_issue
container_start_page	65
container_title
container_volume
creator	Kim, Yoongu Papamichael, Michael Mutlu, Onur Harchol-Balter, Mor
description	In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).
doi_str_mv	10.1109/MICRO.2010.51
format	Conference Proceeding
fullrecord	<record><control><sourceid>acm_6IE</sourceid><recordid>TN_cdi_ieee_primary_5695526</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5695526</ieee_id><sourcerecordid>acm_books_10_1109_MICRO_2010_51</sourcerecordid><originalsourceid>FETCH-LOGICAL-a2631-2724ace9d729653d1ef616a78dbbcf6cd32ef41b3d5c650e6a6a9a0355b070ff3</originalsourceid><addsrcrecordid>eNqNjz1PwzAQhi0BEiV0ZGLJxESKz45tPKKIj0qtKkGZLX-caSAhyKZD_z0p5QfwLqdX9-hODyEXQGcAVN8s583zasbo2AUckTOqpBY101ofkwlQxaq6FnBKpjm_0zGKMabkhJTrTUIbyqbb5m9M5RL7Ie3KF7_BsO3az7dzchJtl3H6Nwvy-nC_bp6qxepx3twtKsskh4opVluPOiimpeABMEqQVt0G53yUPnCGsQbHg_BSUJRWWm0pF8JRRWPkBbk83G0R0XyltrdpZ8RoIcYHBbk6bK3vjRuGj2yAmr25-TU3e3MjYASv_wUal1qM_AcmTFYC</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Kim, Yoongu ; Papamichael, Michael ; Mutlu, Onur ; Harchol-Balter, Mor</creator><creatorcontrib>Kim, Yoongu ; Papamichael, Michael ; Mutlu, Onur ; Harchol-Balter, Mor</creatorcontrib><description>In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).</description><identifier>ISSN: 1072-4451</identifier><identifier>ISBN: 0769542999</identifier><identifier>ISBN: 9780769542997</identifier><identifier>ISBN: 1424490715</identifier><identifier>ISBN: 9781424490714</identifier><identifier>DOI: 10.1109/MICRO.2010.51</identifier><language>eng</language><publisher>Washington, DC, USA: IEEE Computer Society</publisher><subject>Bandwidth ; Clustering algorithms ; fairness ; Hardware -- Hardware validation ; Hardware -- Integrated circuits -- Semiconductor memory ; Instruction sets ; Interference ; memory access behavior ; memory scheduling ; niceness ; Scheduling algorithm ; system throughput ; thread cluster ; Throughput</subject><ispartof>2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010, p.65-76</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5695526$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5695526$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kim, Yoongu</creatorcontrib><creatorcontrib>Papamichael, Michael</creatorcontrib><creatorcontrib>Mutlu, Onur</creatorcontrib><creatorcontrib>Harchol-Balter, Mor</creatorcontrib><title>Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior</title><title>2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture</title><addtitle>micro</addtitle><description>In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).</description><subject>Bandwidth</subject><subject>Clustering algorithms</subject><subject>fairness</subject><subject>Hardware -- Hardware validation</subject><subject>Hardware -- Integrated circuits -- Semiconductor memory</subject><subject>Instruction sets</subject><subject>Interference</subject><subject>memory access behavior</subject><subject>memory scheduling</subject><subject>niceness</subject><subject>Scheduling algorithm</subject><subject>system throughput</subject><subject>thread cluster</subject><subject>Throughput</subject><issn>1072-4451</issn><isbn>0769542999</isbn><isbn>9780769542997</isbn><isbn>1424490715</isbn><isbn>9781424490714</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2010</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNqNjz1PwzAQhi0BEiV0ZGLJxESKz45tPKKIj0qtKkGZLX-caSAhyKZD_z0p5QfwLqdX9-hODyEXQGcAVN8s583zasbo2AUckTOqpBY101ofkwlQxaq6FnBKpjm_0zGKMabkhJTrTUIbyqbb5m9M5RL7Ie3KF7_BsO3az7dzchJtl3H6Nwvy-nC_bp6qxepx3twtKsskh4opVluPOiimpeABMEqQVt0G53yUPnCGsQbHg_BSUJRWWm0pF8JRRWPkBbk83G0R0XyltrdpZ8RoIcYHBbk6bK3vjRuGj2yAmr25-TU3e3MjYASv_wUal1qM_AcmTFYC</recordid><startdate>201012</startdate><enddate>201012</enddate><creator>Kim, Yoongu</creator><creator>Papamichael, Michael</creator><creator>Mutlu, Onur</creator><creator>Harchol-Balter, Mor</creator><general>IEEE Computer Society</general><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201012</creationdate><title>Thread Cluster Memory Scheduling</title><author>Kim, Yoongu ; Papamichael, Michael ; Mutlu, Onur ; Harchol-Balter, Mor</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a2631-2724ace9d729653d1ef616a78dbbcf6cd32ef41b3d5c650e6a6a9a0355b070ff3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Bandwidth</topic><topic>Clustering algorithms</topic><topic>fairness</topic><topic>Hardware -- Hardware validation</topic><topic>Hardware -- Integrated circuits -- Semiconductor memory</topic><topic>Instruction sets</topic><topic>Interference</topic><topic>memory access behavior</topic><topic>memory scheduling</topic><topic>niceness</topic><topic>Scheduling algorithm</topic><topic>system throughput</topic><topic>thread cluster</topic><topic>Throughput</topic><toplevel>online_resources</toplevel><creatorcontrib>Kim, Yoongu</creatorcontrib><creatorcontrib>Papamichael, Michael</creatorcontrib><creatorcontrib>Mutlu, Onur</creatorcontrib><creatorcontrib>Harchol-Balter, Mor</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kim, Yoongu</au><au>Papamichael, Michael</au><au>Mutlu, Onur</au><au>Harchol-Balter, Mor</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior</atitle><btitle>2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture</btitle><stitle>micro</stitle><date>2010-12</date><risdate>2010</risdate><spage>65</spage><epage>76</epage><pages>65-76</pages><issn>1072-4451</issn><isbn>0769542999</isbn><isbn>9780769542997</isbn><isbn>1424490715</isbn><isbn>9781424490714</isbn><abstract>In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).</abstract><cop>Washington, DC, USA</cop><pub>IEEE Computer Society</pub><doi>10.1109/MICRO.2010.51</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1072-4451
ispartof	2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010, p.65-76
issn	1072-4451
language	eng
recordid	cdi_ieee_primary_5695526
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Bandwidth Clustering algorithms fairness Hardware -- Hardware validation Hardware -- Integrated circuits -- Semiconductor memory Instruction sets Interference memory access behavior memory scheduling niceness Scheduling algorithm system throughput thread cluster Throughput
title	Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T20%3A35%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Thread%20Cluster%20Memory%20Scheduling:%20Exploiting%20Differences%20in%20Memory%20Access%20Behavior&rft.btitle=2010%2043rd%20Annual%20IEEE/ACM%20International%20Symposium%20on%20Microarchitecture&rft.au=Kim,%20Yoongu&rft.date=2010-12&rft.spage=65&rft.epage=76&rft.pages=65-76&rft.issn=1072-4451&rft.isbn=0769542999&rft.isbn_list=9780769542997&rft.isbn_list=1424490715&rft.isbn_list=9781424490714&rft_id=info:doi/10.1109/MICRO.2010.51&rft_dat=%3Cacm_6IE%3Eacm_books_10_1109_MICRO_2010_51%3C/acm_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5695526&rfr_iscdi=true