K-NNDP: K-means algorithm based on nearest neighbor density peak optimization and outlier removal

K-means is an unsupervised method for vector quantification derived from signal processing. It is currently used in data mining and knowledge-discovery. The advantages of K-means include its simple operation, scalability, and suitability for processing large-scale datasets. However, K-means randomly...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2024-06, Vol.294, p.111742, Article 111742
Hauptverfasser: Liao, Jiyong, Wu, Xingjiao, Wu, Yaxin, Shu, Juelin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:K-means is an unsupervised method for vector quantification derived from signal processing. It is currently used in data mining and knowledge-discovery. The advantages of K-means include its simple operation, scalability, and suitability for processing large-scale datasets. However, K-means randomly selects the initial cluster center, which causes unstable clustering results, and outliers affect algorithm performance. To address this challenge, we propose a nearest-neighbor density peak (NNDP)-optimized initial cluster center and outlier removal algorithm. To solve the problem of randomly selecting the initial cluster center, we propose NNDP-based K-means (K-NNDP). K-NNDP automatically selects the initial cluster centers based on decision values, ensuring stable algorithm operation. In addition, we adopt a local search strategy to eliminate outliers, identify outliers using a set threshold, and use the median instead of the mean in subsequent centroid iterations to reduce the impact of outliers on the algorithm. It is worth mentioning that, to date, most previous studies have addressed the two problems independently, which makes it easy for the algorithm to fall into a local optimal solution. Therefore, we innovatively combine these two problems using K-nearest neighbor modeling. To evaluate the effectiveness of K-NNDP, we conducted comparative experiments on several synthetic and real-world datasets. K-NNDP outperformed two classical algorithms and six state-of-the-art improved K-means algorithms. The results prove that K-NNDP can effectively solve the problems of randomness and outlier influence of K-means, and the effect is significant. •K-NNDP algorithm address the combined problem of selecting initial centroids and detecting outliers.•Automatically selecting initial cluster centers using nearest neighbor density peaks.•Replacing mean with median to avoid the influence of outliers on clustering results.•Our algorithm outperforms the state-of-the-art improved K-means algorithm.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2024.111742