A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data

Most data streams encountered in real life are data objects with mixed numerical and categorical attributes. Currently most data stream algorithms have shortcomings including low clustering quality, difficulties in determining cluster centers, poor ability for dealing with outliers’ issue. A fast de...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information sciences 2016-06, Vol.345, p.271-293
Hauptverfasser: Chen, Jin-Yin, He, Hui-Hao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Most data streams encountered in real life are data objects with mixed numerical and categorical attributes. Currently most data stream algorithms have shortcomings including low clustering quality, difficulties in determining cluster centers, poor ability for dealing with outliers’ issue. A fast density-based data stream clustering algorithm with cluster centers automatically determined in the initialization stage is proposed. Based on data attribute relationships analysis, mixed data sets are filed into three types whose corresponding distance measure metrics are designed. Based on field intensity-distance distribution graph for each data object, linear regression model and residuals analysis are used to find the outliers of the graph, enabling cluster centers automatic determination. After the cluster centers are found, all data objects can be clustered according to their distance with centers. The data stream clustering algorithm adopts an online/offline two-stage processing framework, and a new micro cluster characteristic vector to maintain the arriving data objects dynamically. Micro clusters decay function and deletion mechanism of micro clusters are used to maintain the micro clusters, which reflects the data stream evolution process accurately. Finally, the performances of the proposed algorithm are testified by a series of experiments on real-world mixed data sets in comparison with several outstanding clustering algorithms in terms of the clustering purity, efficiency and time complexity.
ISSN:0020-0255
1872-6291
DOI:10.1016/j.ins.2016.01.071