A scalable algorithm for mining maximal frequent sequences using a sample

In this paper, we propose an efficient scalable algorithm for mining M aximal S equential P atterns using S ampling (MSPS). The MSPS algorithm reduces much more search space than other algorithms because both the subsequence infrequency-based pruning and the supersequence frequency-based pruning are...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Knowledge and information systems 2008-05, Vol.15 (2), p.149-179
Hauptverfasser:	Luo, Congnan, Chung, Soon M.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Computer Science Data mining Data Mining and Knowledge Discovery Database Management Information Storage and Retrieval Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Regular Paper Sampling techniques Studies
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this paper, we propose an efficient scalable algorithm for mining M aximal S equential P atterns using S ampling (MSPS). The MSPS algorithm reduces much more search space than other algorithms because both the subsequence infrequency-based pruning and the supersequence frequency-based pruning are applied. In MSPS, a sampling technique is used to identify long frequent sequences earlier, instead of enumerating all their subsequences. We propose how to adjust the user-specified minimum support level for mining a sample of the database to achieve better overall performance. This method makes sampling more efficient when the minimum support is small. A signature-based method and a hash-based method are developed for the subsequence infrequency-based pruning when the seed set of frequent sequences for the candidate generation is too big to be loaded into memory. A prefix tree structure is developed to count the candidate sequences of different sizes during the database scanning, and it also facilitates the customer sequence trimming. Our experiments showed MSPS has very good performance and better scalability than other algorithms.
ISSN:	0219-1377 0219-3116
DOI:	10.1007/s10115-006-0056-0