A compressed self-index using a Ziv–Lempel dictionary

A compressed full-text self-index for a text T , of size u , is a data structure used to search for patterns P , of size m , in T , that requires reduced space, i.e. space that depends on the empirical entropy ( H k or H 0 ) of T , and is, furthermore, able to reproduce any substring of T . In this...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information retrieval (Boston) 2008-08, Vol.11 (4), p.359-388
Hauptverfasser:	Russo, Luís M. S., Oliveira, Arlindo L.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Computer Science Data compression Data Mining and Knowledge Discovery Data structures Data Structures and Information Theory Dictionaries Entropy Exact sciences and technology Information and communication sciences Information processing and retrieval Information retrieval. Man machine relationship Information science. Documentation Information Storage and Retrieval Natural Language Processing (NLP) Pattern Recognition Research process. Evaluation Sciences and techniques of general use Studies Suffix trees Text editing Time dependence Trees
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A compressed full-text self-index for a text T , of size u , is a data structure used to search for patterns P , of size m , in T , that requires reduced space, i.e. space that depends on the empirical entropy ( H k or H 0 ) of T , and is, furthermore, able to reproduce any substring of T . In this paper we present a new compressed self-index able to locate the occurrences of P in O (( m + occ )log u ) time, where occ is the number of occurrences. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O ( m 2 ) to O ( m ). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZ-indexes, the suffix tree. We show that our method is very competitive in practice by comparing it against other state of the art compressed indexes.
ISSN:	1386-4564 1573-7659
DOI:	10.1007/s10791-008-9050-3