Parallel Spectral Clustering in Distributed Systems

Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform cluste...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence 2011-03, Vol.33 (3), p.568-586
Hauptverfasser:	Chen, Wen-Yen, Song, Yangqiu, Bai, Hongjie, Lin, Chih-Jen, Chang, Edward Y.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Applied sciences Approximation Artificial Intelligence Cluster Analysis Clustering Clustering algorithms Computation Computer Communication Networks - instrumentation Computer science Computer science control theory systems Computer Simulation Computer systems and distributed systems. User interface Concurrent computing Data processing. List processing. Character string processing Distributed computing Exact sciences and technology Information retrieval. Graph Laplace equations Memory organisation. Data processing Models, Statistical Nearest neighbor searches nearest neighbors normalized cuts Nyström approximation Parallel algorithms Parallel processing Parallel spectral clustering Pattern Recognition, Automated - methods Reproducibility of Results Scalability Similarity Software Sparse matrices Spectra Strategy Systems Integration Theoretical computing USA Councils
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.
ISSN:	0162-8828 1939-3539 2160-9292
DOI:	10.1109/TPAMI.2010.88