Beyond k-Means++: Towards better cluster exploration with geometrical information

Although k-means and its variants are known for their remarkable efficiency, they suffer from a strong dependence on the prior knowledge of K and the assumption of a circle-like pattern, which can result in the algorithms dividing the input space instead of discovering non-predetermined data pattern...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2024-02, Vol.146, p.110036, Article 110036
Hauptverfasser: Ping, Yuan, Li, Huina, Hao, Bin, Guo, Chun, Wang, Baocang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Although k-means and its variants are known for their remarkable efficiency, they suffer from a strong dependence on the prior knowledge of K and the assumption of a circle-like pattern, which can result in the algorithms dividing the input space instead of discovering non-predetermined data patterns. Thus, we propose beyond k-means++ that infers and utilizes explicit clusters by emphasizing local geometrical information for better cluster exploration. To avoid the K dependence, a novel framework of iterative division and aggregation (IDA) over k-means++ is presented. It begins with any K≥1, then increases and reduces K along with the procedure of clusters’ division and aggregation, respectively. To break through the circle-like pattern limitation, we introduce a reasonability checking strategy (RCS) for cluster division. Given local geometrical information, RCS achieves arbitrary cluster shape support by rejecting edge patterns with distinguished convergence direction and merging adjacent clusters with pseudo-edge patterns. Furthermore, we design an edge shrinkage strategy (ESS). Taking edge patterns as the cluster prototype, it benefits accuracy by effectively avoiding representability reduction due to irregular distribution. To compensate for the loss of efficiency, a near maximin and random sampling algorithm is suggested for large-scale data with high dimensionality. Experimental results confirm that beyond k-means++ is featured by handling arbitrary cluster shapes with remarkable accuracy. •A novel framework of iterative division and aggregation (IDA) over k-means++ works with any K.•A reasonability checking strategy (RCS) makes beyond k-means++ support arbitrary cluster shapes.•An edge shrinkage strategy (ESS) allows edge patterns to move slightly towards their centroids.•A beyond k-means++ integrates NMMRS, k-means++, RCS, and ESS into IDA.•Results show the superiority of beyond k-means++ in discovering irregular data distributions.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2023.110036