Monitoring InfiniBand networks to react efficiently to congestion

Current high-performance interconnection networks for HPC and Data-Center systems incorporate mechanisms to prevent congestion from degrading network performance. Specifically, the popular InfiniBand specification defines a mechanism to reduce the injection rate of the traffic flows contributing to...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE MICRO 2023-03, Vol.43 (2), p.1-9
Hauptverfasser: Cascajo, Alberto, Gomez-Lopez, Gabriel, Escudero-Sahuquillo, Jesus, Garcia, Pedro Javier, Singh, David E., Alfaro-Cortes, Francisco, Quiles, Francisco J., Carretero, Jesus
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Current high-performance interconnection networks for HPC and Data-Center systems incorporate mechanisms to prevent congestion from degrading network performance. Specifically, the popular InfiniBand specification defines a mechanism to reduce the injection rate of the traffic flows contributing to congestion. However, the efficiency of this mechanism depends on the values configured for certain parameters, that may be suitable for some congestion situations but not for others. Therefore, we think that these parameters should be reconfigured dynamically, based on accurate and updated information about the actual status of congestion. For that purpose, we have combined a light-weight platform monitoring tool (LIMITLESS) with the InfiniBand control software (OpenSM), so that the former provides the latter with enhanced knowledge about congestion to appropriately reconfigure the parameters driving the behavior of the congestion-control mechanism. Experiments performed in a real InfiniBand-based cluster confirm that this approach significantly reduces the number of wrong reactions to the congestion-control mechanism.
ISSN:0272-1732
1937-4143
DOI:10.1109/MM.2023.3241840