Selecting Robust Features for Machine Learning Applications using Multidata Causal Discovery
Robust feature selection is vital for creating reliable and interpretable Machine Learning (ML) models. When designing statistical prediction models in cases where domain knowledge is limited and underlying interactions are unknown, choosing the optimal set of features is often difficult. To mitigat...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Robust feature selection is vital for creating reliable and interpretable
Machine Learning (ML) models. When designing statistical prediction models in
cases where domain knowledge is limited and underlying interactions are
unknown, choosing the optimal set of features is often difficult. To mitigate
this issue, we introduce a Multidata (M) causal feature selection approach that
simultaneously processes an ensemble of time series datasets and produces a
single set of causal drivers. This approach uses the causal discovery
algorithms PC1 or PCMCI that are implemented in the Tigramite Python package.
These algorithms utilize conditional independence tests to infer parts of the
causal graph. Our causal feature selection approach filters out
causally-spurious links before passing the remaining causal features as inputs
to ML models (Multiple linear regression, Random Forest) that predict the
targets. We apply our framework to the statistical intensity prediction of
Western Pacific Tropical Cyclones (TC), for which it is often difficult to
accurately choose drivers and their dimensionality reduction (time lags,
vertical levels, and area-averaging). Using more stringent significance
thresholds in the conditional independence tests helps eliminate spurious
causal relationships, thus helping the ML model generalize better to unseen TC
cases. M-PC1 with a reduced number of features outperforms M-PCMCI, non-causal
ML, and other feature selection methods (lagged correlation, random), even
slightly outperforming feature selection based on eXplainable Artificial
Intelligence. The optimal causal drivers obtained from our causal feature
selection help improve our understanding of underlying relationships and
suggest new potential drivers of TC intensification. |
---|---|
DOI: | 10.48550/arxiv.2304.05294 |