A compressed self-index using a Ziv–Lempel dictionary

A compressed full-text self-index for a text T , of size u , is a data structure used to search for patterns P , of size m , in T , that requires reduced space, i.e. space that depends on the empirical entropy ( H k or H 0 ) of T , and is, furthermore, able to reproduce any substring of T . In this...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information retrieval (Boston) 2008-08, Vol.11 (4), p.359-388
Hauptverfasser: Russo, Luís M. S., Oliveira, Arlindo L.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A compressed full-text self-index for a text T , of size u , is a data structure used to search for patterns P , of size m , in T , that requires reduced space, i.e. space that depends on the empirical entropy ( H k or H 0 ) of T , and is, furthermore, able to reproduce any substring of T . In this paper we present a new compressed self-index able to locate the occurrences of P in O (( m  +  occ )log  u ) time, where occ is the number of occurrences. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O ( m 2 ) to O ( m ). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZ-indexes, the suffix tree. We show that our method is very competitive in practice by comparing it against other state of the art compressed indexes.
ISSN:1386-4564
1573-7659
DOI:10.1007/s10791-008-9050-3