Missing Data Infill with Automunge
Missing data is a fundamental obstacle in the practice of data science. This paper surveys a few conventions for imputation as available in the Automunge open source python library platform for tabular data preprocessing, including "ML infill" in which auto ML models are trained for target...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Missing data is a fundamental obstacle in the practice of data science. This
paper surveys a few conventions for imputation as available in the Automunge
open source python library platform for tabular data preprocessing, including
"ML infill" in which auto ML models are trained for target features from
partitioned extracts of a training set. A series of validation experiments were
performed to benchmark imputation scenarios towards downstream model
performance, in which it was found for the given benchmark sets that in many
cases ML infill outperformed for both numeric and categoric target features,
and was otherwise at minimum within noise distributions of the other imputation
scenarios. Evidence also suggested supplementing ML infill with the addition of
support columns with boolean integer markers signaling presence of infill was
usually beneficial to downstream model performance. We consider these results
sufficient to recommend defaulting to ML infill for tabular learning, and
further recommend supplementing imputations with support columns signaling
presence of infill, each as can be prepared with push-button operation in the
Automunge library. Our contributions include an auto ML derived missing data
imputation library for tabular learning in the python ecosystem, fully
integrated into a preprocessing platform with an extensive library of feature
transformations, with a novel production friendly implementation that bases
imputation models on a designated train set for consistent basis towards
additional data. |
---|---|
DOI: | 10.48550/arxiv.2202.09484 |