LLload: Simplifying Real-Time Job Monitoring for HPC Users
One of the more complex tasks for researchers using HPC systems is performance monitoring and tuning of their applications. Developing a practice of continuous performance improvement, both for speed-up and efficient use of resources is essential to the long term success of both the HPC practitioner...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | One of the more complex tasks for researchers using HPC systems is
performance monitoring and tuning of their applications. Developing a practice
of continuous performance improvement, both for speed-up and efficient use of
resources is essential to the long term success of both the HPC practitioner
and the research project. Profiling tools provide a nice view of the
performance of an application but often have a steep learning curve and rarely
provide an easy to interpret view of resource utilization. Lower level tools
such as top and htop provide a view of resource utilization for those familiar
and comfortable with Linux but a barrier for newer HPC practitioners. To expand
the existing profiling and job monitoring options, the MIT Lincoln Laboratory
Supercomputing Center created LLoad, a tool that captures a snapshot of the
resources being used by a job on a per user basis. LLload is a tool built from
standard HPC tools that provides an easy way for a researcher to track resource
usage of active jobs. We explain how the tool was designed and implemented and
provide insight into how it is used to aid new researchers in developing their
performance monitoring skills as well as guide researchers in their resource
requests. |
---|---|
DOI: | 10.48550/arxiv.2407.01481 |