Balancing effort and benefit of K-means clustering algorithms in Big Data realms

In this paper we propose a criterion to balance the processing time and the solution quality of k-means cluster algorithms when applied to instances where the number n of objects is big. The majority of the known strategies aimed to improve the performance of k-means algorithms are related to the in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PloS one 2018-09, Vol.13 (9), p.e0201874-e0201874
Hauptverfasser:	Pérez-Ortega, Joaquín, Almanza-Ortega, Nelva Nely, Romero, David
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial intelligence Big data Cluster Analysis Clustering Computer and Information Sciences Computer Simulation Computing time Criteria Data Interpretation, Statistical Data management Data mining Data Mining - methods Data processing Engineering and Technology Experimentation Heuristic International conferences Iterative methods Methods Models, Statistical Normal distribution Numerical Analysis, Computer-Assisted Pattern recognition Performance enhancement Physical Sciences Publishing Reproducibility of Results Research and Analysis Methods Social Sciences Vector quantization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this paper we propose a criterion to balance the processing time and the solution quality of k-means cluster algorithms when applied to instances where the number n of objects is big. The majority of the known strategies aimed to improve the performance of k-means algorithms are related to the initialization or classification steps. In contrast, our criterion applies in the convergence step, namely, the process stops whenever the number of objects that change their assigned cluster at any iteration is lower than a given threshold. Through computer experimentation with synthetic and real instances, we found that a threshold close to 0.03n involves a decrease in computing time of about a factor 4/100, yielding solutions whose quality reduces by less than two percent. These findings naturally suggest the usefulness of our criterion in Big Data realms.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0201874