On clustering categories of categorical predictors in generalized linear models
•The paper proposes a method to cluster categorical features in Generalized Linear Models.•The proposed approach uses a numerical method guided by the learning performance.•The underlying structure of the categories and their relationship is identified using proximity graphs.•Complexity is reduced a...
Gespeichert in:
Veröffentlicht in: | Expert systems with applications 2021-11, Vol.182, p.115245, Article 115245 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •The paper proposes a method to cluster categorical features in Generalized Linear Models.•The proposed approach uses a numerical method guided by the learning performance.•The underlying structure of the categories and their relationship is identified using proximity graphs.•Complexity is reduced and accuracy results are competitive against benchmark one-hot encoding of categorical features.
We propose a method to reduce the complexity of Generalized Linear Models in the presence of categorical predictors. The traditional one-hot encoding, where each category is represented by a dummy variable, can be wasteful, difficult to interpret, and prone to overfitting, especially when dealing with high-cardinality categorical predictors. This paper addresses these challenges by finding a reduced representation of the categorical predictors by clustering their categories. This is done through a numerical method which aims to preserve (or even, improve) accuracy, while reducing the number of coefficients to be estimated for the categorical predictors. Thanks to its design, we are able to derive a proximity measure between categories of a categorical predictor that can be easily visualized. We illustrate the performance of our approach in real-world classification and count-data datasets where we see that clustering the categorical predictors reduces complexity substantially without harming accuracy. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2021.115245 |