What is different between these datasets?
The performance of machine learning models heavily depends on the quality of input data, yet real-world applications often encounter various data-related challenges. One such challenge could arise when curating training data or deploying the model in the real world - two comparable datasets in the s...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The performance of machine learning models heavily depends on the quality of
input data, yet real-world applications often encounter various data-related
challenges. One such challenge could arise when curating training data or
deploying the model in the real world - two comparable datasets in the same
domain may have different distributions. While numerous techniques exist for
detecting distribution shifts, the literature lacks comprehensive approaches
for explaining dataset differences in a human-understandable manner. To address
this gap, we propose a suite of interpretable methods (toolbox) for comparing
two datasets. We demonstrate the versatility of our approach across diverse
data modalities, including tabular data, language, images, and signals in both
low and high-dimensional settings. Our methods not only outperform comparable
and related approaches in terms of explanation quality and correctness, but
also provide actionable, complementary insights to understand and mitigate
dataset differences effectively. |
---|---|
DOI: | 10.48550/arxiv.2403.05652 |