Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model

•A new hybrid method for the imputation of missing values is proposed.•The method is based on a novel fuzzy c-means, mutual information, and regression.•Performance of imputation increases by using Grey in the fuzzy c-means algorithm.•The proposed method outperforms five existing imputation methods,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2019-01, Vol.115, p.68-94
Hauptverfasser: Sefidian, Amir Masoud, Daneshpour, Negin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•A new hybrid method for the imputation of missing values is proposed.•The method is based on a novel fuzzy c-means, mutual information, and regression.•Performance of imputation increases by using Grey in the fuzzy c-means algorithm.•The proposed method outperforms five existing imputation methods, in most cases.•The proposed method can also provide high classification accuracies. The presence of missing values in real-world data is not only a prevalent problem but also an inevitable one. Therefore, missing values should be handled carefully before the mining or learning process. This paper proposes a novel technique to impute missing data. It employs a new version of Fuzzy c-Means clustering algorithm which benefits from advantages of Grey Relational Grade over Minkowski-like similarity measures. To impute a missing value more accurately, it also performs a local mutual information based feature selection in each cluster to select only highly relevant features. Briefly, missing values are imputed in the following steps. First, the algorithm finds the importance of each missing attribute. Next, input instances are separated into several fuzzy clusters. Then, the algorithm selects clusters which satisfy a minimum condition. After that, it chooses highly dependent features of instances within each cluster using a mutual information based feature selection approach. When the features are selected, regression models will be applied to the selected features of the selected clusters to provide estimations for a missing value. Finally, the missing value is imputed through a weighted average of estimated values obtained from the previous step. Three well-known evaluation criteria and the accuracy of classification task are used to assess the performance of the proposed method. The experimental results for seven UCI data sets with different missing ratios and strategies indicate that the proposed algorithm outperforms five other imputation methods in general.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2018.07.057