Improving Clustering Accuracy of K-Means and Random Swap by an Evolutionary Technique Based on Careful Seeding

K-Means is a “de facto” standard clustering algorithm due to its simplicity and efficiency. K-Means, though, strongly depends on the initialization of the centroids (seeding method) and often gets stuck in a local sub-optimal solution. K-Means, in fact, mainly acts as a local refiner of the centroid...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Algorithms 2023-12, Vol.16 (12), p.572
Hauptverfasser:	Nigro, Libero, Cicirelli, Franco
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Algorithms Artificial intelligence Centroids Clustering Computer services industry Datasets evolutionary techniques Greedy-K-Means K-Means measures of clustering quality Methods Optimization Random Swap seeding methods
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	K-Means is a “de facto” standard clustering algorithm due to its simplicity and efficiency. K-Means, though, strongly depends on the initialization of the centroids (seeding method) and often gets stuck in a local sub-optimal solution. K-Means, in fact, mainly acts as a local refiner of the centroids, and it is unable to move centroids all over the data space. Random Swap was defined to go beyond K-Means, and its modus operandi integrates K-Means in a global strategy of centroids management, which can often generate a clustering solution close to the global optimum. This paper proposes an approach which extends both K-Means and Random Swap and improves the clustering accuracy through an evolutionary technique and careful seeding. Two new algorithms are proposed: the Population-Based K-Means (PB-KM) and the Population-Based Random Swap (PB-RS). Both algorithms consist of two steps: first, a population of J candidate solutions is built, and then the candidate centroids are repeatedly recombined toward a final accurate solution. The paper motivates the design of PB-KM and PB-RS, outlines their current implementation in Java based on parallel streams, and demonstrates the achievable clustering accuracy using both synthetic and real-world datasets.
ISSN:	1999-4893 1999-4893
DOI:	10.3390/a16120572