A Sampling-Based Density Peaks Clustering Algorithm for Large-Scale Data

•An improved triangle-inequality-based search strategy is proposed.•An approximate local density calculation of representatives is proposed.•Experiments show that our algorithm costs far less time than DPC and other state-of-the-art algorithms proposed recently. With the rapid development of informa...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition 2023-04, Vol.136, p.109238, Article 109238
Hauptverfasser:	Ding, Shifei, Li, Chao, Xu, Xiao, Ding, Ling, Zhang, Jian, Guo, Lili, Shi, Tianhao
Format:	Artikel
Sprache:	eng
Schlagworte:	Density peaks clustering Large-scale data Sampling method TI search strategy
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•An improved triangle-inequality-based search strategy is proposed.•An approximate local density calculation of representatives is proposed.•Experiments show that our algorithm costs far less time than DPC and other state-of-the-art algorithms proposed recently. With the rapid development of information technology, massive amount of data is generated. How to discover useful information to support decision-making has become one of the focuses of scholar's research. Clustering is thought to be one of the main means to deal with large-scale data. Density peaks clustering (DPC) is an effective density-based clustering algorithm which is widely applied in numerous fields because of its satisfactory performance. However, the computational complexity of DPC is O(N2) which is not friendly to large-scale data. To solve this issue, a sampling-based density peaks clustering algorithm for large-scale data (SDPC) is proposed. Firstly, a sampling method is used to reduce the distance calculations. Secondly, approximate representatives are identified by an improved TI search strategy which further accelerates the clustering process. Afterwards, the approximate representatives are clustered by DPC. Finally, the remaining points are allocated to the same cluster as its nearest representatives. Experimental results on both synthetic datasets and real-world datasets illustrate that SDPC is more efficient than DPC, while its clustering performance maintains the same level as DPC.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2022.109238