ETI: an efficient index for set similarity queries

Set queries are an important topic and have attracted a lot of attention. Earlier research mainly concentrated on set containment queries. In this paper we focus on the T-Overlap query which is the foundation of the set similarity query. To address this issue, unlike traditional algorithms that are...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Frontiers of Computer Science 2012-12, Vol.6 (6), p.700-712
Hauptverfasser: JIA, Lianyin, XI, Jianqing, LI, Mengjuan, LIU, Yong, MIAO, Decheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Set queries are an important topic and have attracted a lot of attention. Earlier research mainly concentrated on set containment queries. In this paper we focus on the T-Overlap query which is the foundation of the set similarity query. To address this issue, unlike traditional algorithms that are based on an inverted index, we design a new paradigm based on the prefix tree (trie) called the expanded trie index (ETI) which expands the trie node structure by adding some new properties. Based on ETI, we convert the T- Overlap problem to finding query nodes with specific query depth equaling to T and propose a new algorithm called T- Similarity to solve T-Overlap efficiently. Then we carry out a three-step framework to extend T-Overlap to other simi- larity predicates. Extensive experiments are carried out to compare T-Similarity with other inverted index based algorithms from cardinality of query, overlap threshold, dataset size, the number of distinct elements and so on. Results show that T-Similarity outperforms the state-of-the-art algorithms in many aspects.
ISSN:1673-7350
2095-2228
1673-7466
2095-2236
DOI:10.1007/s11704-012-1237-5