A novel XGBoost extension for credit scoring class-imbalanced data combining a generalized extreme value link and a modified focal loss function

•We propose several XGBoost extensions to learn class-imbalanced data.•Fit the techniques to a vast, dynamic and public Freddie Mac mortgages data.•Assess model performance relative to data complexity and external validity.•Gauge the business value of the techniques using a profitability measure.•A...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2022-09, Vol.202, p.117233, Article 117233
Hauptverfasser: Mushava, Jonah, Murray, Michael
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We propose several XGBoost extensions to learn class-imbalanced data.•Fit the techniques to a vast, dynamic and public Freddie Mac mortgages data.•Assess model performance relative to data complexity and external validity.•Gauge the business value of the techniques using a profitability measure.•A proposed GEV link with a modified focal loss in XGBoost is best for rare/outlier cases. There is often a significant class imbalance in credit scoring datasets, mainly in portfolios of secured loans such as mortgage loans. A class imbalance occurs when the number of non-default cases outweighs the number of default cases. A naive classifier can achieve high accuracy by assigning all cases to the majority class; however, misclassifying the minority class is often costly. In XGBoost, a well-known and robust classification method, we propose that the quantile function of the generalized extreme value (GEV) distribution is used as a link function to enhance the detection of rare cases. To complement the GEV link function, the study applies a modified focal loss function in XGBoost to jointly penalize misclassification of the class of interest and focus on hard, tricky to classify cases. We test our proposal on a vast database of mortgage loans with rare default cases, available on the Freddie Mac website. As benchmarks, we also consider other common large credit scoring databases, existing extensions of XGBoost to handle classification imbalance and other state-of-the-art classification techniques for learning class-imbalanced data. According to the results, the proposed model has a superior predictive power to other competing models if the class imbalance is due to default events being outliers or rare in the dataset. We also demonstrate that the results will likely hold up in real-world situations and add business value under certain portfolio characteristics.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2022.117233