On a parallel spark workflow for frequent itemset mining based on array prefix‐tree
Extracting frequent itemsets from datasets is an important problem in data mining, for which several mining methods including FP‐Growth have been proposed. FP‐Growth is a classical frequent itemset mining method, which generates pattern databases without candidates. Many improvements have been made...
Gespeichert in:
Veröffentlicht in: | Concurrency and computation 2022-06, Vol.34 (14), p.n/a |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Extracting frequent itemsets from datasets is an important problem in data mining, for which several mining methods including FP‐Growth have been proposed. FP‐Growth is a classical frequent itemset mining method, which generates pattern databases without candidates. Many improvements have been made in the literature due to the high time complexity and memory usage of FP‐Growth. However, most of them still suffer from performance issues on large datasets. In this paper, we design an auxiliary structure, Array Prefix‐Tree (AP‐Tree), and propose a new algorithm, Array Prefix‐Tree Growth (APT‐Growth), which is further parallelized as a Spark workflow, referred to as PAPT‐Growth. Based on a density threshold, we incorporate an adaptive algorithm selection process into PAPT‐Growth to ensure its running time performance. We conduct extensive experiments on different thresholds and multiple datasets, and experimental results show the performance superiority of PAPT‐Growth in comparison with several state‐of‐the‐art methods such as PFP, YAFIM, and DFPS. The analysis on density reveals a changing point, which justifies the necessity and validity of adaptive algorithm selection. |
---|---|
ISSN: | 1532-0626 1532-0634 |
DOI: | 10.1002/cpe.6313 |