When stakes are high: Balancing accuracy and transparency with Model-Agnostic Interpretable Data-driven suRRogates

Technological advancements allow to develop high-performance black box predictive models. However, strictly regulated industries (like banking and insurance) ask for transparent decision-making algorithms. We therefore present a procedure to develop a Model-Agnostic Interpretable Data-driven suRRoga...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2022-09, Vol.202, p.117230, Article 117230
Hauptverfasser: Henckaerts, Roel, Antonio, Katrien, Côté, Marie-Pier
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Technological advancements allow to develop high-performance black box predictive models. However, strictly regulated industries (like banking and insurance) ask for transparent decision-making algorithms. We therefore present a procedure to develop a Model-Agnostic Interpretable Data-driven suRRogate (maidrr) suited for structured tabular data. Knowledge is extracted from a black box via partial dependence effects. These are used to perform smart feature engineering by grouping variable values. This results in a segmentation of the feature space with automatic variable selection. A transparent generalized linear model (GLM) is fit to the features in categorical format and their relevant interactions. This GLM serves as a global surrogate to the original black box and replaces it in production. We demonstrate our R package maidrr with a case study on general insurance claim frequency modeling for six publicly available datasets. Our maidrr GLM closely approximates a gradient boosting machine (GBM) black box and outperforms both a linear and tree surrogate as benchmarks. •Procedure to develop an interpretable global surrogate for a complex system.•Surrogate closely approximates a black box model regarding accuracy and fidelity.•Automatic feature selection, segmentation and both global and local explanations.•Satisfy transparency needs of a strictly regulated industry or high-stakes decision.•Case study on insurance claim frequency prediction for six public datasets.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2022.117230