Effective and efficient retrieval of structured entities
Structured entities are commonly abstracted, such as from XML, RDF or hidden-web databases. Direct retrieval of various structured entities is highly demanded in data lakes, e.g., given a JSON object, to find the XML entities that denote the same real-world object. Existing approaches on evaluating...
Gespeichert in:
Veröffentlicht in: | Proceedings of the VLDB Endowment 2020-02, Vol.13 (6), p.826-839 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Structured entities are commonly abstracted, such as from XML, RDF or hidden-web databases. Direct retrieval of various structured entities is highly demanded in data lakes, e.g., given a JSON object, to find the XML entities that denote the same real-world object. Existing approaches on evaluating structured entity similarity emphasize too much the structural inconsistency. Indeed, entities from heterogeneous sources could have very distinct structures, owing to various information representation conventions. We argue that the retrieval could be more tolerant to structural differences and focus more on the contents of the entities. In this paper, we first identify the unique challenge of parent-child (containment) relationships among structured entities, which unfortunately prevent the retrieval of proper entities (returning parents or children). To solve the problem, a novel hierarchy smooth function is proposed to combine the term scores in different nodes of a structured entity. Entities sharing the same structure, namely an entity family, are employed to learn the coefficient in aggregating the scores, and thus distinguish/prune the parent or child entities. Remarkably, the proposed method could cooperate with both the bag-of-words (BOW) and word embedding models, successful in retrieving unstructured documents, for querying structured entities. Extensive experiments on real datasets demonstrate that our proposal is effective and efficient. |
---|---|
ISSN: | 2150-8097 2150-8097 |
DOI: | 10.14778/3380750.3380754 |