On a parallel spark workflow for frequent itemset mining based on array prefix‐tree

Extracting frequent itemsets from datasets is an important problem in data mining, for which several mining methods including FP‐Growth have been proposed. FP‐Growth is a classical frequent itemset mining method, which generates pattern databases without candidates. Many improvements have been made...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Concurrency and computation 2022-06, Vol.34 (14), p.n/a
Hauptverfasser: Niu, Xinzheng, Wu, Peng, Wu, Chase Q., Hou, Aiqin, Qian, Mideng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Extracting frequent itemsets from datasets is an important problem in data mining, for which several mining methods including FP‐Growth have been proposed. FP‐Growth is a classical frequent itemset mining method, which generates pattern databases without candidates. Many improvements have been made in the literature due to the high time complexity and memory usage of FP‐Growth. However, most of them still suffer from performance issues on large datasets. In this paper, we design an auxiliary structure, Array Prefix‐Tree (AP‐Tree), and propose a new algorithm, Array Prefix‐Tree Growth (APT‐Growth), which is further parallelized as a Spark workflow, referred to as PAPT‐Growth. Based on a density threshold, we incorporate an adaptive algorithm selection process into PAPT‐Growth to ensure its running time performance. We conduct extensive experiments on different thresholds and multiple datasets, and experimental results show the performance superiority of PAPT‐Growth in comparison with several state‐of‐the‐art methods such as PFP, YAFIM, and DFPS. The analysis on density reveals a changing point, which justifies the necessity and validity of adaptive algorithm selection.
ISSN:1532-0626
1532-0634
DOI:10.1002/cpe.6313