PipeFL: Hardware/Software co-Design of an FPGA Accelerator for Federated Learning

Federated learning has solved the problems of data silos and data fragmentation on the premise of satisfying privacy. However, cryptographic algorithms in federated learning brought significant increase in computational complexity, which limited the speed of model training. In this paper, we propose...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2022, Vol.10, p.98649-98661
Hauptverfasser: Wang, Zixiao, Che, Biyao, Guo, Liang, Du, Yang, Chen, Ying, Zhao, Jizhuang, He, Wei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Federated learning has solved the problems of data silos and data fragmentation on the premise of satisfying privacy. However, cryptographic algorithms in federated learning brought significant increase in computational complexity, which limited the speed of model training. In this paper, we propose a hardware/software (HW/SW) co-designed field programmable gate array (FPGA) accelerator for federated learning. Firstly, we analyzed the time consumption of each stage in federated learning and the involved cryptographic algorithms, and found the performance bottleneck. Secondly, a HW/SW co-designed architecture is introduced, which can speed up encryption, decryption and ciphertext-space computation at the same time without reconfiguring FPGA circuit. In the HW part, we proposed a Hardware-aware Montgomery Algorithm (HWMA) which utilized data parallelism and pipeline, and designed an FPGA architecture to decouple data access and computation. In the SW part, an Operator Scheduling Engine (OSE) is designed, which can flexibly resolve the target algorithm into multiple HWMA calls, and complete other non-computation-intensive calculations. Finally, evaluations for both specific algorithms and practical applications are implemented. Experimental results show that when deployed on Intel Stratix 10 FPGA, our accelerator can increase the throughput of 2048-bit modular multiplication, modular exponentiation and Paillier algorithm to more than 3x of the CPU. When integrated into a industrial grade federated learning open source framework, the end-to-end training time of linear regression and logistic regression can be shortened by 2.28x and 3.30x respectively, which is more than 2x faster than the reported best results of FPGA accelerator.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2022.3206785