Dynamic Traffic Control of Staging Traffic on the Interconnect of the HPC Cluster System

High-performance computing (HPC) cluster systems sometimes adopt a two-layered file system composed of local and global file systems to achieve both capacity and performance in storage. In such a cluster system, the input data of an application needs to be staged from the global storage into the loc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2020, Vol.8, p.198518-198531
Hauptverfasser: Endo, Arata, Ohtsuji, Hiroki, Hayashi, Erika, Yoshida, Eiji, Lee, Chunghan, Date, Susumu, Shimojo, Shinji
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:High-performance computing (HPC) cluster systems sometimes adopt a two-layered file system composed of local and global file systems to achieve both capacity and performance in storage. In such a cluster system, the input data of an application needs to be staged from the global storage into the local storage, and the output data needs to be staged from the local storage out to the global storage. This staging operation must be efficiently and quickly performed to gain higher job throughput because an inefficient staging operation prevents waiting job requests from being executed. In particular, in the case of the cluster system with the oversubscribed interconnect shared by the storage and the computing nodes, the inter-node communication and this staging operation traffic collides, which may degrade the job throughput. In this research, we focus on the traffic collision of the inter-node communication and the staging traffic to improve job throughput, targeting the cluster system with the oversubscribed interconnect where these two types of traffic flow. In other words, whether the dynamic control of the traffic flow derived from the staging operation leads to the improvement in the job throughput or not is investigated. For the investigation, we present a traffic collision avoidance method to dynamically configure a set of data paths for each type of the traffic only while the staging operation is conducted. The evaluation in this article shows that the proposed method avoids a traffic collision and accelerates the staging operation by 22.0% on our cluster system. Also, this evaluation indicates the overhead of the application incurred by the proposed method is negligible. Furthermore, 8.7% of the job execution time is reduced by the proposed method.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2020.3035158