A multiple association-based unsupervised feature selection algorithm for mixed data sets

Companies have an increasing access to very large datasets within their domain. Analysing these datasets often requires the application of feature selection techniques in order to reduce the dimensionality of the data and prioritize features for downstream knowledge generation tasks. Effective featu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2023-02, Vol.212, p.118718, Article 118718
Hauptverfasser: Taha, Ayman, Hadi, Ali S., Cosgrave, Bernard, McKeever, Susan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Companies have an increasing access to very large datasets within their domain. Analysing these datasets often requires the application of feature selection techniques in order to reduce the dimensionality of the data and prioritize features for downstream knowledge generation tasks. Effective feature selection is a key part of clustering, regression and classification. It presents a myriad of opportunities to improve the machine learning pipeline: eliminating redundant and irrelevant features, reducing model over-fitting, faster model training times and more explainable models. By contrast, and despite the widespread availability and use of categorical data in practice, feature selection for categorical and/or mixed data has received relatively little attention in comparison to numerical data. Furthermore, existing feature selection methods for mixed data are sensitive to number of objects by having nonlinear time complexities with respect to number of objects. In this work, we propose a generic multiple association measure for mixed datasets and a novel feature selection algorithm that uses multiple association across features. Our algorithm is based upon the belief that the most representative chosen set of features should be as diverse and minimally dependent on each other as possible. The proposed algorithm formulates the problem of feature selection as an optimization problem, searching for the set of features that have minimum association amongst them. We present a generic multiple association measure and two associated feature selection algorithms: Naive and Greedy Feature Selection Algorithms called NFSA and GFSA, respectively. Our proposed GFSA algorithm is evaluated on 15 benchmark datasets, and compared to four existing state of the art feature selection techniques. We demonstrate that our approach provides comparable downstream classification performance outperforming other leading techniques on several datasets. Both time complexity analysis and experimental results show that our proposed algorithm significantly reduces the processing time required for unsupervised feature selection algorithms especially for long datasets which have a huge number of objects, whilst also yielding comparable clustering and classification performance. On the other hand, we do not recommend our approach for wide datasets where the number of features is huge with respect to the number of objects e.g., image, text and genome datasets. •New generic multiple associat
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2022.118718