GraphCC: A practical graph learning-based approach to Congestion Control in datacenters
Congestion Control (CC) plays a fundamental role in optimizing traffic in Datacenter Networks (DCNs). Currently, DCNs implement two main CC protocols: DCTCP and DCQCN. Both protocols are based on Explicit Congestion Notification (ECN), where switches mark packets when they detect congestion. Nowaday...
Gespeichert in:
Veröffentlicht in: | Computer networks (Amsterdam, Netherlands : 1999) Netherlands : 1999), 2025-02, Vol.257, p.110981, Article 110981 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Congestion Control (CC) plays a fundamental role in optimizing traffic in Datacenter Networks (DCNs). Currently, DCNs implement two main CC protocols: DCTCP and DCQCN. Both protocols are based on Explicit Congestion Notification (ECN), where switches mark packets when they detect congestion. Nowadays, network experts carefully set ECN parameters to optimize the average network performance. However, today’s DCNs experience rapid and abrupt changes that severely affect the network state (e.g., dynamic workloads, incasts), which leads to under-utilization and sub-optimal performance. In this paper we present GraphCC, a framework for in-network CC optimization. GraphCC relies on Multi-agent Reinforcement Learning (MARL) and Graph Neural Networks (GNN), and is compatible with widely deployed ECN-based CC protocols. The proposed solution deploys distributed agents on switches that communicate with their neighbors to cooperate and optimize the global ECN configuration. In our evaluation, we test GraphCC with three real-world traffic workloads, focusing on its capability to accommodate scenarios unseen during training (e.g., traffic changes, failures). We compare GraphCC with a state-of-the-art MARL solution for ECN tuning, and observe that our method outperforms the state-of-the-art baseline in all evaluation scenarios, with improvements up to 20% in average Flow Completion Time, similar mean throughput (within 1%), and significant reductions in buffer occupancy (38.0–85.7%). |
---|---|
ISSN: | 1389-1286 |
DOI: | 10.1016/j.comnet.2024.110981 |