An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU

The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on ser...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on architecture and code optimization 2024-09, Vol.21 (3), p.1-25, Article 44
Hauptverfasser: Wang, Ziheng, Dong, Xiaoshe, Kang, Yan, Chen, Heng, Wang, Qiang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on serial Merkle tree traversal, which is not conducive to taking full advantage of the computing power of parallel architectures such as CPUs and GPUs. We propose a parallel Merkle tree traversal (PMTT), which is tested by implementing LMS on the GPU. This is the first work accelerating LMS on the GPU, which performs well even with over 10,000 cores. Considering different scenarios of algorithmic parallelism and data parallelism, we implement corresponding variants for PMTT. The design of PMTT for algorithmic parallelism mainly considers the execution efficiency of a single task, while that for data parallelism starts with the full utilisation of GPU performance. In addition, we are the first to design a CPU-GPU collaborative processing solution for traversal algorithms to reduce the communication overhead between CPU and GPU. For algorithmic parallelism, our implementation is still 4.48× faster than the ideal time of the state-of-the-art traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8,192, the parallel efficiency is 78.39%. In comparison, our LMS implementation outperforms most existing LMS and XMSS implementations.
ISSN:1544-3566
1544-3973
DOI:10.1145/3659209