Compression in XML search engines

The structure of XML documents can be used by search engines to answer structured queries or to provide better relevancy. Several index structures exist for search in XML data. This study focuses on inverted lists with dictionary coded path types and dewey coded path instances. The dewey coded path...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Natvig, Ola
Format: Dissertation
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The structure of XML documents can be used by search engines to answer structured queries or to provide better relevancy. Several index structures exist for search in XML data. This study focuses on inverted lists with dictionary coded path types and dewey coded path instances. The dewey coded path index is large, but could be compressed. This study examines query processing with indexes encoded using well known integer coding methods VByte and PFor(delta) and methods tailored for the dewey index. Intersection queries and structural queries are evaluated. In addition to standard document level skipping, skip operations for path types are implemented and evaluated. Four extensions over plain PFor methods are proposed and tested. Path type sorting sorts dewey codes on their path types and store all deweys from one path type together. Column wise dewey storage stores the deweys in columns instead of rows. Prefix coding a well known method, is adapted to the column wise dewey storage, and dynamic column wise method chooses between row wise and column wise storage based on the compressed data. Experiments are performed on a XML collection based on Wikipedia. Queries are generated from the TREC 06 efficiency task query trace. Several different types of structural queries have been executed. Experiments show that column wise methods perform good on both intersection and structural queries. The dynamic column wise scheme is in most cases the best, and creates the smallest index. Special purpose skipping for path types makes some queries extremely fast and can be implemented with only limited storage footprint. The performance of in-memory search with multi-threaded query execution is limited by memory bandwidth.