Optimizing Performance on Trinity Utilizing Machine Learning, Proxy Applications and Scheduling Priorities
The sheer number of nodes continues to increase in todays supercomputers, the first half of Trinity alone contains more than 9400 compute nodes. Since the speed of todays clusters are limited by the slowest nodes, it more important than ever to identify slow nodes, improve their performance if it ca...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The sheer number of nodes continues to increase in todays supercomputers, the
first half of Trinity alone contains more than 9400 compute nodes. Since the
speed of todays clusters are limited by the slowest nodes, it more important
than ever to identify slow nodes, improve their performance if it can be done,
and assure minimal usage of slower nodes during performance critical runs. This
is an ongoing maintenance task that occurs on a regular basis and, therefore,
it is important to minimize the impact upon its users by assessing and
addressing slow performing nodes and mitigating their consequences while
minimizing down time. These issues can be solved, in large part, through a
systematic application of fast running hardware assessment tests, the
application of Machine Learning, and making use of performance data to increase
efficiency of large clusters. Proxy applications utilizing both MPI and OpenMP
were developed to produce data as a substitute for long runtime applications to
evaluate node performance. Machine learning is applied to identify
underperforming nodes, and policies are being discussed to both minimize the
impact of underperforming nodes and increase the efficiency of the system. In
this paper, I will describe the process used to produce quickly performing
proxy tests, consider various methods to isolate the outliers, and produce
ordered lists for use in scheduling to accomplish this task. |
---|---|
DOI: | 10.48550/arxiv.2404.10617 |