Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components

We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of eac...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Advances in data analysis and classification 2023-12, Vol.17 (4), p.1081-1122
Hauptverfasser:	du Roy de Chaumaray, Marie, Marbac, Matthieu
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Chemistry and Earth Sciences Clustering Computer Science Data Mining and Knowledge Discovery Density Economics Estimation Finance Health Sciences Humanities Insurance Law Management Mathematics and Statistics Medicine Optimization Physics Probabilistic models Regular Article Statistical Theory and Methods Statistics Statistics for Business Statistics for Engineering Statistics for Life Sciences Statistics for Social Sciences
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of each univariate density. Note that the mixture model is used for clustering but not for estimating the density of the full variables (observed and unobserved). Estimation is performed by maximizing an extension of the smoothed likelihood allowing missingness. This optimization is achieved by a Majorization-Minorization algorithm. We illustrate the relevance of our approach by numerical experiments conducted on simulated data. Under mild assumptions, we show the identifiability of the model defining the distribution of the observed data and the monotonicity of the algorithm. We also propose an extension of this new method to the case of mixed-type data that we illustrate on a real data set. The proposed method is implemented in the R package MNARclust available on CRAN.
ISSN:	1862-5347 1862-5355
DOI:	10.1007/s11634-023-00534-w