SMUTF: Schema Matching Using Generative Tags and Hybrid Features
We introduce SMUTF, a unique approach for large-scale tabular data schema matching (SM), which assumes that supervised learning does not affect performance in open-domain tasks, thereby enabling effective cross-domain matching. This system uniquely combines rule-based feature engineering, pre-traine...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We introduce SMUTF, a unique approach for large-scale tabular data schema
matching (SM), which assumes that supervised learning does not affect
performance in open-domain tasks, thereby enabling effective cross-domain
matching. This system uniquely combines rule-based feature engineering,
pre-trained language models, and generative large language models. In an
innovative adaptation inspired by the Humanitarian Exchange Language, we deploy
'generative tags' for each data column, enhancing the effectiveness of SM.
SMUTF exhibits extensive versatility, working seamlessly with any pre-existing
pre-trained embeddings, classification methods, and generative models.
Recognizing the lack of extensive, publicly available datasets for SM, we
have created and open-sourced the HDXSM dataset from the public humanitarian
data. We believe this to be the most exhaustive SM dataset currently available.
In evaluations across various public datasets and the novel HDXSM dataset,
SMUTF demonstrated exceptional performance, surpassing existing
state-of-the-art models in terms of accuracy and efficiency, and} improving the
F1 score by 11.84% and the AUC of ROC by 5.08%. |
---|---|
DOI: | 10.48550/arxiv.2402.01685 |