Bayesian network Motifs for reasoning over heterogeneous unlinked datasets

Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Data mining and knowledge discovery 2024-11, Vol.38 (6), p.3643-3689
Hauptverfasser:	Sui, Yi, Kwan, Alex, Olson, Alexander W., Sanner, Scott, Silver, Daniel A.
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Bayesian analysis Chemistry and Earth Sciences Computer Science Data collection Data Mining and Knowledge Discovery Datasets Information Storage and Retrieval Physics Reasoning Spatial data Statistics for Engineering
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent probabilistic reasoning over these heterogeneous datasets even when the data and shared attributes exhibit significant mismatches that are common in real-world data. Different datasets have different sample biases, disagree on category definitions and spatial representations, collect data at different temporal intervals, and mix aggregate-level with individual data. In this work, we demonstrate how a set of Bayesian network motifs allows all of these mismatches to be resolved in a composable framework that permits joint probabilistic reasoning over all datasets without manipulating, modifying, or imputing the original data, thus avoiding potentially harmful assumptions. We provide an open source Python tool that encapsulates our methodology and demonstrate this tool on a number of real-world use cases.
ISSN:	1384-5810 1573-756X
DOI:	10.1007/s10618-024-01054-7