A mixture model approach to spectral clustering and application to textual data

The spectral clustering algorithm is a technique based on the properties of the pairwise similarity matrix coming from a suitable kernel function. It is a useful approach for high-dimensional data since the units are clustered in feature space with a reduced number of dimensions. In this paper, we c...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Statistical methods & applications 2022-12, Vol.31 (5), p.1071-1097
Hauptverfasser:	Di Nuzzo, Cinzia, Ingrassia, Salvatore
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Chemistry and Earth Sciences Cluster analysis Clustering Computer Science Datasets Economics Finance Health Sciences Humanities Insurance Kernel functions Law Management Mathematics and Statistics Medicine Original Paper Physics Probabilistic models Robustness (mathematics) Spectral methods Statistical Theory and Methods Statistics Statistics for Business Statistics for Engineering Statistics for Life Sciences Statistics for Social Sciences
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The spectral clustering algorithm is a technique based on the properties of the pairwise similarity matrix coming from a suitable kernel function. It is a useful approach for high-dimensional data since the units are clustered in feature space with a reduced number of dimensions. In this paper, we consider a two-step model-based approach within the spectral clustering framework. Based on simulated data, first, we discuss criteria for selecting the number of clusters and analyzing the robustness of the model-based approach concerning the choice of the proximity parameters of the kernel functions. Finally, we consider applications of the spectral methods to cluster five real textual datasets and, in this framework, a new kernel function is also proposed. The approach is illustrated on the ground of a large numerical study based on both simulated and real datasets.
ISSN:	1618-2510 1613-981X
DOI:	10.1007/s10260-022-00635-4