Model free variable importance for high dimensional data
A model-agnostic variable importance method can be used with arbitrary prediction functions. Here we present some model-free methods that do not require access to the prediction function. This is useful when that function is proprietary and not available, or just extremely expensive. It is also usef...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A model-agnostic variable importance method can be used with arbitrary
prediction functions. Here we present some model-free methods that do not
require access to the prediction function. This is useful when that function is
proprietary and not available, or just extremely expensive. It is also useful
when studying residuals from a model. The cohort Shapley (CS) method is
model-free but has exponential cost in the dimension of the input space. A
supervised on-manifold Shapley method from Frye et al. (2020) is also model
free but requires as input a second black box model that has to be trained for
the Shapley value problem. We introduce an integrated gradient (IG) version of
cohort Shapley, called IGCS, with cost $\mathcal{O}(nd)$. We show that over the
vast majority of the relevant unit cube that the IGCS value function is close
to a multilinear function for which IGCS matches CS. Another benefit of IGCS is
that is allows IG methods to be used with binary predictors. We use some area
between curves (ABC) measures to quantify the performance of IGCS. On a problem
from high energy physics we verify that IGCS has nearly the same ABCs as CS
does. We also use it on a problem from computational chemistry in 1024
variables. We see there that IGCS attains much higher ABCs than we get from
Monte Carlo sampling. The code is publicly available at
https://github.com/cohortshapley/cohortintgrad |
---|---|
DOI: | 10.48550/arxiv.2211.08414 |