Abstract Interpretation-Based Data Leakage Static Analysis
Data leakage is a well-known problem in machine learning. Data leakage occurs when information from outside the training dataset is used to create a model. This phenomenon renders a model excessively optimistic or even useless in the real world since the model tends to leverage greatly on the unfair...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data leakage is a well-known problem in machine learning. Data leakage occurs
when information from outside the training dataset is used to create a model.
This phenomenon renders a model excessively optimistic or even useless in the
real world since the model tends to leverage greatly on the unfairly acquired
information. To date, detection of data leakages occurs post-mortem using
run-time methods. However, due to the insidious nature of data leakage, it may
not be apparent to a data scientist that a data leakage has occurred in the
first place. For this reason, it is advantageous to detect data leakages as
early as possible in the development life cycle. In this paper, we propose a
novel static analysis to detect several instances of data leakages during
development time. We define our analysis using the framework of abstract
interpretation: we define a concrete semantics that is sound and complete, from
which we derive a sound and computable abstract semantics. We implement our
static analysis inside the open-source NBLyzer static analysis framework and
demonstrate its utility by evaluating its performance and precision on over
2000 Kaggle competition notebooks. |
---|---|
DOI: | 10.48550/arxiv.2211.16073 |