JEDI: These aren't the JSON documents you're looking for... (Extended Version)
The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold $\tau$, retrieve all JSON documents that are withi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The JavaScript Object Notation (JSON) is a popular data format used in
document stores to natively support semi-structured data. In this paper, we
address the problem of JSON similarity lookup queries: given a query document
and a distance threshold $\tau$, retrieve all JSON documents that are within
$\tau$ from the query document. Due to its recursive definition, JSON data are
naturally represented as trees. Different from other hierarchical formats such
as XML, JSON supports both ordered and unordered sibling collections within a
single document. This feature poses a new challenge to the tree model and
distance computation. We propose JSON tree, a lossless tree representation of
JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based
distance measure for JSON documents. We develop an algorithm, called QuickJEDI,
for computing JEDI by leveraging a new technique to prune expensive sibling
matchings. It outperforms a baseline algorithm by an order of magnitude in
runtime. To boost the performance of JSON similarity queries, we introduce an
index called JSIM and a highly effective upper bound based on tree sorting. Our
algorithm for the upper bound runs in $O(n \tau)$ time and $O(n + \tau \log n)$
space, which substantially improves the previous best bound of $O(n^2)$ time
and $O(n \log n)$ space (where $n$ is the tree size). Our experimental
evaluation shows that our solution scales to databases with millions of
documents and JSON trees with tens of thousands of nodes. |
---|---|
DOI: | 10.48550/arxiv.2201.08099 |