Elastic net-based high dimensional data selection for regression

High-dimensional feature selection is of particular interest to researchers. In some domains, such as microarray data, it is quite common for a group of highly correlated explanatory variables to be of equal importance for inclusion in the predictive model. This paper proposes a new hybrid feature s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2024-06, Vol.244, p.122958, Article 122958
Hauptverfasser:	Chamlal, Hasna, Benzmane, Asmaa, Ouaderhman, Tayeb
Format:	Artikel
Sprache:	eng
Schlagworte:	Elastic net Feature screening High-dimensional data Rank correlation Regression
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	High-dimensional feature selection is of particular interest to researchers. In some domains, such as microarray data, it is quite common for a group of highly correlated explanatory variables to be of equal importance for inclusion in the predictive model. This paper proposes a new hybrid feature selection approach that integrates feature screening based on Kendall’s tau and Elastic Net regularized regression (K-EN). K-EN as an approach that embeds the Elastic Net, has the advantage of the grouping effect, which automatically includes all the highly correlated variables in the group. The K-EN approach offers insightful solutions to high-dimensional regression problems and improves Elastic Net performance since the screening phase is preceded by a step that further reduces the number of explanatory variables by removing those that disagree with the target based on Kendall’s tau. The use of Kendall’s tau further enhances Elastic Net performance, as it is robust enough to handle heavy-tailed distributions, non-parametric models, outliers, and non-normal data with greater ease. K-EN is therefore a time-saving approach. The proposed algorithm is evaluated on four simulation scenarios and four publicly available datasets, including riboflavin, eyedata, Longley, and Boston Housing, and achieves 0.2528, 0.0098, 0.1007, and 0.4121 respectively as the Mean Squared Error (MSE). K-EN’s MSEs are the best compared to those achieved by the state-of-the-art approaches reviewed in this paper. In addition, K-EN selects up to 100% of relevant features when run on simulated data.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2023.122958