A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data
Most data streams encountered in real life are data objects with mixed numerical and categorical attributes. Currently most data stream algorithms have shortcomings including low clustering quality, difficulties in determining cluster centers, poor ability for dealing with outliers’ issue. A fast de...
Gespeichert in:
Veröffentlicht in: | Information sciences 2016-06, Vol.345, p.271-293 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Most data streams encountered in real life are data objects with mixed numerical and categorical attributes. Currently most data stream algorithms have shortcomings including low clustering quality, difficulties in determining cluster centers, poor ability for dealing with outliers’ issue. A fast density-based data stream clustering algorithm with cluster centers automatically determined in the initialization stage is proposed. Based on data attribute relationships analysis, mixed data sets are filed into three types whose corresponding distance measure metrics are designed. Based on field intensity-distance distribution graph for each data object, linear regression model and residuals analysis are used to find the outliers of the graph, enabling cluster centers automatic determination. After the cluster centers are found, all data objects can be clustered according to their distance with centers. The data stream clustering algorithm adopts an online/offline two-stage processing framework, and a new micro cluster characteristic vector to maintain the arriving data objects dynamically. Micro clusters decay function and deletion mechanism of micro clusters are used to maintain the micro clusters, which reflects the data stream evolution process accurately. Finally, the performances of the proposed algorithm are testified by a series of experiments on real-world mixed data sets in comparison with several outstanding clustering algorithms in terms of the clustering purity, efficiency and time complexity. |
---|---|
ISSN: | 0020-0255 1872-6291 |
DOI: | 10.1016/j.ins.2016.01.071 |