Collie: Finding Performance Anomalies in RDMA Subsystems
High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentially trigger abnormal performance behaviors (...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | High-speed RDMA networks are getting rapidly adopted in the industry for
their low latency and reduced CPU overheads. To verify that RDMA can be used in
production, system administrators need to understand the set of application
workloads that can potentially trigger abnormal performance behaviors (e.g.,
unexpected low throughput, PFC pause frame storm). We design and implement
Collie, a tool for users to systematically uncover performance anomalies in
RDMA subsystems without the need to access hardware internal designs. Instead
of individually testing each hardware device (e.g., NIC, memory, PCIe), Collie
is holistic, constructing a comprehensive search space for application
workloads. Collie then uses simulated annealing to drive RDMA-related
performance and diagnostic counters to extreme value regions to find workloads
that can trigger performance anomalies. We evaluate Collie on combinations of
various RDMA NIC, CPU, and other hardware components. Collie found 15 new
performance anomalies. All of them are acknowledged by the hardware vendors. 7
of them are already fixed after we reported them. We also present our
experience in using Collie to avoid performance anomalies for an RDMA RPC
library and an RDMA distributed machine learning framework. |
---|---|
DOI: | 10.48550/arxiv.2304.11467 |