On clustering categories of categorical predictors in generalized linear models

•The paper proposes a method to cluster categorical features in Generalized Linear Models.•The proposed approach uses a numerical method guided by the learning performance.•The underlying structure of the categories and their relationship is identified using proximity graphs.•Complexity is reduced a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2021-11, Vol.182, p.115245, Article 115245
Hauptverfasser:	Carrizosa, Emilio, Galvis Restrepo, Marcela, Romero Morales, Dolores
Format:	Artikel
Sprache:	eng
Schlagworte:	Categories Clustering Complexity Generalized linear models Greedy randomized adaptive search procedure Interpretability Numerical methods Proximity between categories Statistical learning Statistical models
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•The paper proposes a method to cluster categorical features in Generalized Linear Models.•The proposed approach uses a numerical method guided by the learning performance.•The underlying structure of the categories and their relationship is identified using proximity graphs.•Complexity is reduced and accuracy results are competitive against benchmark one-hot encoding of categorical features. We propose a method to reduce the complexity of Generalized Linear Models in the presence of categorical predictors. The traditional one-hot encoding, where each category is represented by a dummy variable, can be wasteful, difficult to interpret, and prone to overfitting, especially when dealing with high-cardinality categorical predictors. This paper addresses these challenges by finding a reduced representation of the categorical predictors by clustering their categories. This is done through a numerical method which aims to preserve (or even, improve) accuracy, while reducing the number of coefficients to be estimated for the categorical predictors. Thanks to its design, we are able to derive a proximity measure between categories of a categorical predictor that can be easily visualized. We illustrate the performance of our approach in real-world classification and count-data datasets where we see that clustering the categorical predictors reduces complexity substantially without harming accuracy.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2021.115245