Impact of sharing-based thread placement on multithreaded architectures

Multithreaded architectures context switch between instruction streams to hide memory access latency. Although this improves processor utilization, it can increase cache interference and degrade overall performance. One technique to reduce the interconnect traffic is to co-locate threads that share...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Thekkath, R., Eggers, S. J.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Computer systems organization > Architectures > Parallel architectures Computing methodologies > Modeling and simulation > Model development and analysis > Modeling methodologies General and reference > Cross-computing tools and techniques > Performance
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Multithreaded architectures context switch between instruction streams to hide memory access latency. Although this improves processor utilization, it can increase cache interference and degrade overall performance. One technique to reduce the interconnect traffic is to co-locate threads that share data on the same processor. Multiple threads sharing in the cache should reduce compulsory and invalidation misses, thereby improving execution time.To test this hypothesis, we compared a variety of thread placement algorithms via trace-driven simulation of fourteen coarse- and medium-grain parallel applications on several multithreaded architectures. Our results contradict the hypothesis. Rather than decreasing, compulsory and invalidation misses remained nearly constant across all placement algorithms, for all processor configurations, even with an infinite cache. That is, sharing-based placement had no (positive) effect on execution time. Instead, load balancing was the critical factor that affected performance. Our results were due to one or both of the following reasons: (1) the sequential and uniform access of shared data by the application's threads and (2) the insignificant number of data references that require interconnect access, relative to the total number of instructions.
DOI:	10.1145/191995.192027