Undersampled K-means approach for handling imbalanced distributed data

K -means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of K -means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Progress in artificial intelligence 2014, Vol.3 (1), p.29-38
Hauptverfasser:	Kumar, N. Santhosh, Rao, K. Nageswara, Govardhan, A., Reddy, K. Sudheer, Mahmood, Ali Mirza
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Computational Intelligence Computer Imaging Computer Science Control Data Mining and Knowledge Discovery Mechatronics Natural Language Processing (NLP) Pattern Recognition and Graphics Regular Paper Robotics Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	K -means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of K -means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the “uniform effect”. In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the K -means clustering process. As the minority class decreases in size, the “uniform effect” becomes evident. To prevent the effect of the “uniform effect”, we revisit the well-known K -means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.
ISSN:	2192-6352 2192-6360
DOI:	10.1007/s13748-014-0045-6