New and efficient DCA based algorithms for minimum sum-of-squares clustering

The purpose of this paper is to develop new efficient approaches based on DC (Difference of Convex functions) programming and DCA (DC Algorithm) to perform clustering via minimum sum-of-squares Euclidean distance. We consider the two most widely used models for the so-called Minimum Sum-of-Squares C...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2014-01, Vol.47 (1), p.388-401
Hauptverfasser: Hoai An, Le Thi, Hoai Minh, Le, Tao, Pham Dinh
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The purpose of this paper is to develop new efficient approaches based on DC (Difference of Convex functions) programming and DCA (DC Algorithm) to perform clustering via minimum sum-of-squares Euclidean distance. We consider the two most widely used models for the so-called Minimum Sum-of-Squares Clustering (MSSC in short) that are a bilevel programming problem and a mixed integer program. Firstly, the mixed integer formulation of MSSC is carefully studied and is reformulated as a continuous optimization problem via a new result on exact penalty technique in DC programming. DCA is then investigated to the resulting problem. Secondly, we introduce a Gaussian kernel version of the bilevel programming formulation of MSSC, named GKMSSC. The GKMSSC problem is formulated as a DC program for which a simple and efficient DCA scheme is developed. A regularization technique is investigated for exploiting the nice effect of DC decomposition and a simple procedure for finding good starting points of DCA is developed. The proposed DCA schemes are original and very inexpensive because they amount to computing, at each iteration, the projection of points onto a simplex and/or onto a ball, and/or onto a box, which are all determined in the explicit form. Numerical results on real word datasets show the efficiency, the scalability of DCA and its great superiority with respect to k-means and kernel k-means, standard methods for clustering. •The purpose of this paper is to develop new efficient approaches based on DC programming and DCA for clustering.•We consider the two models for the Minimum Sum-of-Squares Clustering (MSSC): a mixed integer program and a bilevel programming formulation.•The mixed integer program is reformulated as a DC program via an exact penalty technique for which DCA is investigated.•A Gaussian kernel version of the bilevel programming formulation is introduced and a simple and efficient DCA scheme is developed.•The two main features of DCA, say the effect of DC decomposition and of starting points, are carefully studied.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2013.07.012