Exploring the Influence of Sampling on Pattern Support Distribution
Identifying the pattern support distribution (PSD) in datasets is useful for many data mining tasks, such as market basket analysis. The support of a pattern is the frequency of its occurrence in a dataset. Calculating the distribution of these supports over an entire dataset is computationally expe...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Identifying the pattern support distribution (PSD) in datasets is useful for many data mining tasks, such as market basket analysis. The support of a pattern is the frequency of its occurrence in a dataset. Calculating the distribution of these supports over an entire dataset is computationally expensive; this cost can be reduced by sampling from the dataset and computing the PSD on a relatively small sample. However, this may miscount patterns and cause significant changes in the distribution identified. Based on the fact that the PSD shows a power-law relationship, in this paper we investigate the influence of sampling on the characteristics of the power-law relationship in the pattern support distribution. We consider sampling effect on this relationship under two assumptions: uniform distribution of pattern supports, and independent identically distributed (i.i.d.) distributions. We experimentally evaluate the influence on data from four real-world transaction datasets. |
---|---|
DOI: | 10.1109/CIT.2008.Workshops.91 |