Modeling and optimizing NUMA effects and prefetching with machine learning
Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HPC performance. Optimizing both together leads to a large and complex design space that has previously been impractical to explore at runtime. In this work we deliver the performance benefits of optimi...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: |
Computing methodologies
> Machine learning
> Learning paradigms
> Supervised learning
> Supervised learning by regression
Computing methodologies
> Machine learning
> Learning paradigms
> Unsupervised learning
> Cluster analysis
Computing methodologies
> Modeling and simulation
> Model development and analysis
> Model verification and validation
|
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HPC performance. Optimizing both together leads to a large and complex design space that has previously been impractical to explore at runtime.
In this work we deliver the performance benefits of optimizing both NUMA thread/data placement and prefetcher configuration at runtime through careful modeling and online profiling. To address the large design space, we propose a prediction model that reduces the amount of input information needed and the complexity of the prediction required. We do so by selecting a subset of performance counters and application configurations that provide the richest profile information as inputs, and by limiting the output predictions to a subset of configurations that cover most of the performance.
Our model is robust and can choose near-optimal NUMA+Pre-fetcher configurations for applications from only two profile runs. We further demonstrate how to profile online with low overhead, resulting in a technique that delivers an average of 1.68X performance improvement over a locality-optimized NUMA baseline with all prefetchers enabled. |
---|---|
DOI: | 10.1145/3392717.3392765 |