PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches
As the last-level on-chip caches in chip-multiprocessors increase in size, the physical locality of on-chip data becomes important for delivering high performance. The non-uniform access latency seen by a core to different independent banks of a large cache spread over the chip necessitates active m...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 238 |
---|---|
container_issue | |
container_start_page | 227 |
container_title | |
container_volume | |
creator | Chaudhuri, M. |
description | As the last-level on-chip caches in chip-multiprocessors increase in size, the physical locality of on-chip data becomes important for delivering high performance. The non-uniform access latency seen by a core to different independent banks of a large cache spread over the chip necessitates active mechanisms for improving data locality. The central proposal of this paper is a fully hardwired coarse-grain data migration mechanism that dynamically monitors the access patterns of the cores at the granularity of a page to reduce the book-keeping overhead and decides when and where to migrate an entire page of data to amortize the performance overhead. The page-grain migration mechanism is compared against two variants of previously proposed cache block-grain dynamic migration mechanisms and two OS-assisted static locality management mechanisms. Our detailed execution-driven simulation of an eight-core chip-multiprocessor with a shared 16 MB L2 cache employing a bidirectional ring to connect the cores and the L2 cache banks shows that hardwired dynamic page migration, while using only 4.8% of extra storage out of the total L2 cache and book-keeping budget, delivers the best performance and energy-efficiency across a set of shared memory parallel applications selected from the SPLASH-2, SPEC OMP, DARPA DIS, and FFTW suites and multiprogrammed workloads prepared out of the SPEC 2000 and BioBench suites. It reduces execution time by 18.7% and 12.6% on average (geometric mean) respectively for the shared memory applications and the multiprogrammed workloads compared to a baseline architecture that distributes the pages round-robin across the L2 cache banks. |
doi_str_mv | 10.1109/HPCA.2009.4798258 |
format | Conference Proceeding |
fullrecord | <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_4798258</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4798258</ieee_id><sourcerecordid>4798258</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-6bfc591020e19c39d783a368a7214c56452a9452b990a93da74cac036728bc723</originalsourceid><addsrcrecordid>eNotkE1LAzEQhoMfYK39AeIlfyA1n5vEW1nUCkULWvBWpum0jex2l2Q99N8b0TnMwPMyz-El5FbwqRDc38-X9WwqOfdTbb2Txp2RkVTWMcnV5zm5FlpqLb2S6oKMhFGcceftFZnk_MXLaKM0FyMSl7DH11U9e6Dv2GAYcEv7rokhYqa7LtG-5GyfIB5p0wVo4nCiLRwLbfE40F8MaY80HyCV33CIPWu_myH2qQuYc1EECAfMN-RyB03Gyf8dk9XT40c9Z4u355d6tmBRWDOwarMLxgsuOQoflN9ap0BVDqwUOphKGwm-rI33HLzagtXFz1VlpdsEK9WY3P15IyKu-xRbSKf1f0vqB-GeWZM</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Chaudhuri, M.</creator><creatorcontrib>Chaudhuri, M.</creatorcontrib><description>As the last-level on-chip caches in chip-multiprocessors increase in size, the physical locality of on-chip data becomes important for delivering high performance. The non-uniform access latency seen by a core to different independent banks of a large cache spread over the chip necessitates active mechanisms for improving data locality. The central proposal of this paper is a fully hardwired coarse-grain data migration mechanism that dynamically monitors the access patterns of the cores at the granularity of a page to reduce the book-keeping overhead and decides when and where to migrate an entire page of data to amortize the performance overhead. The page-grain migration mechanism is compared against two variants of previously proposed cache block-grain dynamic migration mechanisms and two OS-assisted static locality management mechanisms. Our detailed execution-driven simulation of an eight-core chip-multiprocessor with a shared 16 MB L2 cache employing a bidirectional ring to connect the cores and the L2 cache banks shows that hardwired dynamic page migration, while using only 4.8% of extra storage out of the total L2 cache and book-keeping budget, delivers the best performance and energy-efficiency across a set of shared memory parallel applications selected from the SPLASH-2, SPEC OMP, DARPA DIS, and FFTW suites and multiprogrammed workloads prepared out of the SPEC 2000 and BioBench suites. It reduces execution time by 18.7% and 12.6% on average (geometric mean) respectively for the shared memory applications and the multiprogrammed workloads compared to a baseline architecture that distributes the pages round-robin across the L2 cache banks.</description><identifier>ISSN: 1530-0897</identifier><identifier>ISBN: 1424429323</identifier><identifier>ISBN: 9781424429325</identifier><identifier>EISSN: 2378-203X</identifier><identifier>DOI: 10.1109/HPCA.2009.4798258</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cache storage ; Computer science ; Data engineering ; Delay ; Energy storage ; Engineering management ; Proposals ; SDRAM ; Switches ; Technology management</subject><ispartof>2009 IEEE 15th International Symposium on High Performance Computer Architecture, 2009, p.227-238</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4798258$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2051,27904,54898</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4798258$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Chaudhuri, M.</creatorcontrib><title>PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches</title><title>2009 IEEE 15th International Symposium on High Performance Computer Architecture</title><addtitle>HPCA</addtitle><description>As the last-level on-chip caches in chip-multiprocessors increase in size, the physical locality of on-chip data becomes important for delivering high performance. The non-uniform access latency seen by a core to different independent banks of a large cache spread over the chip necessitates active mechanisms for improving data locality. The central proposal of this paper is a fully hardwired coarse-grain data migration mechanism that dynamically monitors the access patterns of the cores at the granularity of a page to reduce the book-keeping overhead and decides when and where to migrate an entire page of data to amortize the performance overhead. The page-grain migration mechanism is compared against two variants of previously proposed cache block-grain dynamic migration mechanisms and two OS-assisted static locality management mechanisms. Our detailed execution-driven simulation of an eight-core chip-multiprocessor with a shared 16 MB L2 cache employing a bidirectional ring to connect the cores and the L2 cache banks shows that hardwired dynamic page migration, while using only 4.8% of extra storage out of the total L2 cache and book-keeping budget, delivers the best performance and energy-efficiency across a set of shared memory parallel applications selected from the SPLASH-2, SPEC OMP, DARPA DIS, and FFTW suites and multiprogrammed workloads prepared out of the SPEC 2000 and BioBench suites. It reduces execution time by 18.7% and 12.6% on average (geometric mean) respectively for the shared memory applications and the multiprogrammed workloads compared to a baseline architecture that distributes the pages round-robin across the L2 cache banks.</description><subject>Cache storage</subject><subject>Computer science</subject><subject>Data engineering</subject><subject>Delay</subject><subject>Energy storage</subject><subject>Engineering management</subject><subject>Proposals</subject><subject>SDRAM</subject><subject>Switches</subject><subject>Technology management</subject><issn>1530-0897</issn><issn>2378-203X</issn><isbn>1424429323</isbn><isbn>9781424429325</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2009</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNotkE1LAzEQhoMfYK39AeIlfyA1n5vEW1nUCkULWvBWpum0jex2l2Q99N8b0TnMwPMyz-El5FbwqRDc38-X9WwqOfdTbb2Txp2RkVTWMcnV5zm5FlpqLb2S6oKMhFGcceftFZnk_MXLaKM0FyMSl7DH11U9e6Dv2GAYcEv7rokhYqa7LtG-5GyfIB5p0wVo4nCiLRwLbfE40F8MaY80HyCV33CIPWu_myH2qQuYc1EECAfMN-RyB03Gyf8dk9XT40c9Z4u355d6tmBRWDOwarMLxgsuOQoflN9ap0BVDqwUOphKGwm-rI33HLzagtXFz1VlpdsEK9WY3P15IyKu-xRbSKf1f0vqB-GeWZM</recordid><startdate>200902</startdate><enddate>200902</enddate><creator>Chaudhuri, M.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>200902</creationdate><title>PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches</title><author>Chaudhuri, M.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-6bfc591020e19c39d783a368a7214c56452a9452b990a93da74cac036728bc723</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2009</creationdate><topic>Cache storage</topic><topic>Computer science</topic><topic>Data engineering</topic><topic>Delay</topic><topic>Energy storage</topic><topic>Engineering management</topic><topic>Proposals</topic><topic>SDRAM</topic><topic>Switches</topic><topic>Technology management</topic><toplevel>online_resources</toplevel><creatorcontrib>Chaudhuri, M.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chaudhuri, M.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches</atitle><btitle>2009 IEEE 15th International Symposium on High Performance Computer Architecture</btitle><stitle>HPCA</stitle><date>2009-02</date><risdate>2009</risdate><spage>227</spage><epage>238</epage><pages>227-238</pages><issn>1530-0897</issn><eissn>2378-203X</eissn><isbn>1424429323</isbn><isbn>9781424429325</isbn><abstract>As the last-level on-chip caches in chip-multiprocessors increase in size, the physical locality of on-chip data becomes important for delivering high performance. The non-uniform access latency seen by a core to different independent banks of a large cache spread over the chip necessitates active mechanisms for improving data locality. The central proposal of this paper is a fully hardwired coarse-grain data migration mechanism that dynamically monitors the access patterns of the cores at the granularity of a page to reduce the book-keeping overhead and decides when and where to migrate an entire page of data to amortize the performance overhead. The page-grain migration mechanism is compared against two variants of previously proposed cache block-grain dynamic migration mechanisms and two OS-assisted static locality management mechanisms. Our detailed execution-driven simulation of an eight-core chip-multiprocessor with a shared 16 MB L2 cache employing a bidirectional ring to connect the cores and the L2 cache banks shows that hardwired dynamic page migration, while using only 4.8% of extra storage out of the total L2 cache and book-keeping budget, delivers the best performance and energy-efficiency across a set of shared memory parallel applications selected from the SPLASH-2, SPEC OMP, DARPA DIS, and FFTW suites and multiprogrammed workloads prepared out of the SPEC 2000 and BioBench suites. It reduces execution time by 18.7% and 12.6% on average (geometric mean) respectively for the shared memory applications and the multiprogrammed workloads compared to a baseline architecture that distributes the pages round-robin across the L2 cache banks.</abstract><pub>IEEE</pub><doi>10.1109/HPCA.2009.4798258</doi><tpages>12</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1530-0897 |
ispartof | 2009 IEEE 15th International Symposium on High Performance Computer Architecture, 2009, p.227-238 |
issn | 1530-0897 2378-203X |
language | eng |
recordid | cdi_ieee_primary_4798258 |
source | IEEE Electronic Library (IEL) Conference Proceedings |
subjects | Cache storage Computer science Data engineering Delay Energy storage Engineering management Proposals SDRAM Switches Technology management |
title | PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T23%3A21%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=PageNUCA:%20Selected%20policies%20for%20page-grain%20locality%20management%20in%20large%20shared%20chip-multiprocessor%20caches&rft.btitle=2009%20IEEE%2015th%20International%20Symposium%20on%20High%20Performance%20Computer%20Architecture&rft.au=Chaudhuri,%20M.&rft.date=2009-02&rft.spage=227&rft.epage=238&rft.pages=227-238&rft.issn=1530-0897&rft.eissn=2378-203X&rft.isbn=1424429323&rft.isbn_list=9781424429325&rft_id=info:doi/10.1109/HPCA.2009.4798258&rft_dat=%3Cieee_6IE%3E4798258%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=4798258&rfr_iscdi=true |