OPTWEB: A Lightweight Fully Connected Inter-FPGA Network for Efficient Collectives

Modern FPGA accelerators can be equipped with many high-bandwidth network I/Os, e.g., 64 x 50 Gbps, enabled by onboard optics or co-packaged optics. Some dozens of tightly coupled FPGA accelerators form an emerging computing platform for distributed data processing. However, a conventional indirect...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computers 2021-06, Vol.70 (6), p.849-862
Hauptverfasser:	Mizutani, Kenji, Yamaguchi, Hiroshi, Urino, Yutaka, Koibuchi, Michihiro
Format:	Artikel
Sprache:	eng
Schlagworte:	Accelerators Bandwidth circuit-switching networks collectives Data processing Ethernet fiber optics Field programmable gate arrays FPGAs Interconnection architecture Interconnections Lightweight Logic Network latency Network topologies Network topology Optical switches Optics Switching circuits Synchronism Toruses
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Modern FPGA accelerators can be equipped with many high-bandwidth network I/Os, e.g., 64 x 50 Gbps, enabled by onboard optics or co-packaged optics. Some dozens of tightly coupled FPGA accelerators form an emerging computing platform for distributed data processing. However, a conventional indirect packet network using Ethernet's Intellectual Properties imposes an unacceptably large amount of the logic for handling such high-bandwidth interconnects on an FPGA. Besides the indirect network, another approach builds a direct packet network. Existing direct inter-FPGA networks have a low-radix network topology, e.g., 2-D torus. However, the low-radix network has the disadvantage of a large diameter and large average shortest path length that increases the latency of collectives. To mitigate both problems, we propose a lightweight, fully connected inter-FPGA network called OPTWEB for efficient collectives. Since all end-to-end separate communication paths are statically established using onboard optics, raw block data can be transferred with simple link-level synchronization. Once each source FPGA assigns a communication stream to a path by its internal switch logic between memory-mapped and stream interfaces for remote direct memory access (RDMA), a one-hop transfer is provided. Since each FPGA performs input/output of the remote memory access between all FPGAs simultaneously, multiple RDMAs efficiently form collectives. The OPTWEB network provides 0.71-μsec start-up latency of collectives among multiple Intel Stratix 10 MX FPGA cards with onboard optics. The OPTWEB network consumes 31.4 and 57.7 percent of adaptive logic modules for aggregate 400-Gbps and 800-Gbps interconnects on a custom Stratix 10 MX 2100 FPGA, respectively. The OPTWEB network reduces by 40 percent the cost compared to a conventional packet network.
ISSN:	0018-9340 1557-9956
DOI:	10.1109/TC.2021.3068715