Diagnosing End-Host Network Bottlenecks in RDMA Servers
In RDMA (Remote Direct Memory Access) networks, end-host networks, including intra-host networks and RNICs (RDMA NIC), were considered robust and have received little attention. However, as the RNIC line rate rapidly increases to multi-hundred gigabits, the intra-host network becomes a potential per...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on networking 2024-10, Vol.32 (5), p.4302-4316 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In RDMA (Remote Direct Memory Access) networks, end-host networks, including intra-host networks and RNICs (RDMA NIC), were considered robust and have received little attention. However, as the RNIC line rate rapidly increases to multi-hundred gigabits, the intra-host network becomes a potential performance bottleneck for network applications. Intra-host network bottlenecks can result in degraded intra-host bandwidth and increased intra-host latency. In addition, RNIC network problems can result in connection failures and packet drops. Host network problems can severely degrade network performance. However, when host network problems occur, they can hardly be noticed due to the lack of a monitoring system. Furthermore, existing diagnostic mechanisms cannot efficiently diagnose host network problems. In this paper, we analyze the symptom of host network problems based on our long-term troubleshooting experience and propose Hostping, the first monitoring and diagnostic system dedicated to host networks. The core idea of Hostping is to conduct 1) loopback tests between RNICs and endpoints within the host to measure intra-host latency and bandwidth, and 2) mutual probing between RNICs on a host to measure RNIC connectivity. We have deployed Hostping on thousands of servers in our distributed machine learning system. Not only can Hostping detect and diagnose host network problems we already knew in minutes, but it also reveals eight problems we did not notice before. |
---|---|
ISSN: | 1063-6692 1558-2566 |
DOI: | 10.1109/TNET.2024.3416419 |