Understanding memory access patterns using the BSC performance tools
•This paper describes a tighter integration of the BSC tools and the “perf” tool in order to provide analysis of applications in three directions: source code, memory references and performance.•Removing the necessity of using “perf” externally, simplifying the collection mechanism.•Taking advantage...
Gespeichert in:
Veröffentlicht in: | Parallel computing 2018-10, Vol.78, p.1-14 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •This paper describes a tighter integration of the BSC tools and the “perf” tool in order to provide analysis of applications in three directions: source code, memory references and performance.•Removing the necessity of using “perf” externally, simplifying the collection mechanism.•Taking advantage of already existing capabilities on the BSC tools, such as multiplexing performance counters.•Analysis of two well-known benchmarks using these tools and the evaluation of modifications to the benchmarks.
The growing gap between processor and memory speeds has lead to complex memory hierarchies as processors evolve to mitigate such divergence by exploiting the locality of reference. In this direction, the BSC performance analysis tools have been recently extended to provide insight into the application memory accesses by depicting their temporal and spatial characteristics, correlating with the source-code and the achieved performance simultaneously. These extensions rely on the Precise Event-Based Sampling (PEBS) mechanism available in recent Intel processors to capture information regarding the application memory accesses. The sampled information is later combined with the Folding technique to represent a detailed temporal evolution of the memory accesses and in conjunction with the achieved performance and the source-code counterpart. The reports generated by the latter tool help not only application developers but also processor architects to understand better how the application behaves and how the system performs. In this paper, we describe a tighter integration of the sampling mechanism into the monitoring package. We also demonstrate the value of the complete workflow by exploring already optimized state–of–the–art benchmarks, providing detailed insight of their memory access behavior. We have taken advantage of this insight to apply small modifications that improve the applications’ performance. |
---|---|
ISSN: | 0167-8191 1872-7336 |
DOI: | 10.1016/j.parco.2018.06.007 |