BANDAR: Benchmarking Snippet Generation Algorithms for (RDF) Dataset Search

The large volume of open data on the Web is expected to be reused and create value. Finding the right data to reuse is a non-trivial task addressed by the recent dataset search systems, which retrieve datasets relevant to a keyword query. An important component of such systems is snippet generation,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on knowledge and data engineering 2023-02, Vol.35 (2), p.1227-1241
Hauptverfasser:	Wang, Xiaxia, Cheng, Gong, Pan, Jeff Z., Kharlamov, Evgeny, Qu, Yuzhong
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms benchmark Benchmark testing Benchmarks dataset search Datasets Europe Evaluation Measurement Metadata Open data Quality assessment RDF data Resource description framework Search process Snippet generation Urban areas
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The large volume of open data on the Web is expected to be reused and create value. Finding the right data to reuse is a non-trivial task addressed by the recent dataset search systems, which retrieve datasets relevant to a keyword query. An important component of such systems is snippet generation, extracting data from a retrieved dataset to exemplify its content and explain its relevance to the query. Snippet generation algorithms have emerged but were mainly evaluated by user studies. More efficient and reproducible evaluation methods are needed. To meet this challenge, in this article, we present a set of quality metrics for assessing the usefulness of a snippet from different perspectives, and we select and aggregate them into quality profiles for different stages of a dataset search process. Furthermore, we create a benchmark from thousands of collected real-world data needs and datasets, on which we apply the presented quality metrics and profiles to evaluate snippets generated by two existing algorithms and three adapted algorithms. The results, which are reproducible as they are automatically computed without human interaction, show the pros and cons of the tested algorithms and highlight directions for future research. The benchmark data is publicly available.
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2021.3095309