Statistically Significant Pattern Mining With Ordinal Utility

Statistically significant pattern mining (SSPM), which evaluates each pattern via a hypothesis test, is an essential and challenging data mining task for knowledge discovery. We introduce a preference relation between patterns and aim to discover the most preferred patterns under the constraint of s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2023-09, Vol.35 (9), p.8770-8783
Hauptverfasser: Tran, Thien Q., Fukuchi, Kazuto, Akimoto, Youhei, Sakuma, Jun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Statistically significant pattern mining (SSPM), which evaluates each pattern via a hypothesis test, is an essential and challenging data mining task for knowledge discovery. We introduce a preference relation between patterns and aim to discover the most preferred patterns under the constraint of statistical significance, which has never been considered in existing SSPM problems. We propose an iterative multiple testing procedure that can alternately reject a hypothesis and safely ignore the less useful hypotheses than the rejected one. By filtering out patterns with low utility, we can avoid the significance budget consumption of rejecting useless (uninteresting) patterns and focus the significance budget on more useful patterns, leading to more useful discoveries. We show that the proposed method can control the familywise error rate (FWER) under certain assumptions, which can be satisfied by a realistic problem class in SSPM. We also show that the proposed method always discovers equally or more useful patterns than Tarone-Bonferroni and Subfamily-wise Multiple Testing (SMT). Finally, we conducted several experiments with both synthetic and real-world data to evaluate the performance of our method. The proposed method discovered many more useful patterns in the experiments with real-world datasets than the existing method for all five conducted tasks.
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2022.3208626