Variable Selection for Clustering with Gaussian Mixture Models

This article is concerned with variable selection for cluster analysis. The problem is regarded as a model selection problem in the model-based cluster analysis context. A model generalizing the model of Raftery and Dean (2006, Journal of the American Statistical Association 101, 168-178) is propose...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Biometrics 2009-09, Vol.65 (3), p.701-709
Hauptverfasser:	Maugis, Cathy, Celeux, Gilles, Martin-Magniette, Marie-Laure
Format:	Artikel
Sprache:	eng
Schlagworte:	Bayes' factor Bayesian analysis BIC Biometric Methodology Biometrics Biometry - methods Biostatistics Clinical Trials as Topic Cluster Analysis Computer Simulation Criteria Data Interpretation, Statistical Datasets Effect Modifier, Epidemiologic Genes Identifiability linear models Linear regression Model-based clustering Modeling Models, Statistical Normal Distribution Parametric models Proportional Hazards Models Regression Analysis Statistical relevance model Transcriptomes Variable selection
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This article is concerned with variable selection for cluster analysis. The problem is regarded as a model selection problem in the model-based cluster analysis context. A model generalizing the model of Raftery and Dean (2006, Journal of the American Statistical Association 101, 168-178) is proposed to specify the role of each variable. This model does not need any prior assumptions about the linear link between the selected and discarded variables. Models are compared with Bayesian information criterion. Variable role is obtained through an algorithm embedding two backward stepwise algorithms for variable selection for clustering and linear regression. The model identifiability is established and the consistency of the resulting criterion is proved under regularity conditions. Numerical experiments on simulated datasets and a genomic application highlight the interest of the procedure.
ISSN:	0006-341X 1541-0420
DOI:	10.1111/j.1541-0420.2008.01160.x