Inferring feature importance with uncertainties with application to large genotype data

Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generatin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PLoS computational biology 2023-03, Vol.19 (3), p.e1010963-e1010963
Hauptverfasser:	Johnsen, Pål Vegard, Strümke, Inga, Langaas, Mette, DeWan, Andrew Thomas, Riemer-Sørensen, Signe
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Biobanks Biology and Life Sciences Computer and Information Sciences Confidence intervals Decomposition Engineering and Technology Estimation theory Expected values Game theory Genotype Genotype & phenotype Genotypes Genotyping Techniques Machine learning Neural networks Obesity Physical Sciences Random variables Resampling Research and Analysis Methods Synthetic data Tree structures (Computers) Uncertainty Values
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.
ISSN:	1553-7358 1553-734X 1553-7358
DOI:	10.1371/journal.pcbi.1010963