Set similarity query algorithm based on length partition

The invention relates to a set similarity query method based on length partition, and belongs to the field of data mining and information retrieval. The method comprises the following steps: sorting and numbering sets, namely records, in a data set; constructing an inverted index structure for the s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: LI XIAOWU, HU JUNTAO, LEI YAN, JIA LIANYIN, DING JIAMAN, SHEN BINGLIN, YOU JINGUO, ZUO YUHAO
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention relates to a set similarity query method based on length partition, and belongs to the field of data mining and information retrieval. The method comprises the following steps: sorting and numbering sets, namely records, in a data set; constructing an inverted index structure for the sorted data set and constructing a length mapping table; and for a given query q and a similarity threshold t, retrieving all records of which the similarity with q is greater than or equal to t according to the created inverted index structure and length mapping table. According to the method, the thought of length partitioning is combined with a classic similarity query algorithm ScanCount. Records which cannot meet the similarity can be quickly filtered through data preprocessing, length partitioning and an efficient index structure. Therefore, the algorithm efficiency is improved. In addition, a simpler counting array is designed, so that the space overhead is reduced. Therefore, the method provided by the inven