On the Approximation Ratio of Ordered Parsings

Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is \boldsymbol {b} , the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on information theory 2021-02, Vol.67 (2), p.1008-1026
Hauptverfasser:	Navarro, Gonzalo, Ochoa, Carlos, Prezza, Nicola
Format:	Artikel
Sprache:	eng
Schlagworte:	Approximation Burrows-Wheeler transform collage systems Compressibility context-free grammars Entropy Entropy (Information theory) Genomics Grammar Grammars greedy parsing Internet Lempel-Ziv complexity lexicographic parsing Lower bounds Mathematical analysis optimal bidirectional parsing Optimization ordered parsing Probabilistic logic repetitive sequences run-length compressed Burrows-Wheeler transform Transforms Upper bound Upper bounds
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is \boldsymbol {b} , the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing \boldsymbol {b} is NP-complete, a popular gold standard is \boldsymbol {z} , the number of phrases in the Lempel-Ziv parse of the text, which is computed in linear time and yields the least number of phrases when those can be copied only from the left. Almost nothing has been known for decades about the approximation ratio of \boldsymbol {z} with respect to \boldsymbol {b} . In this paper we prove that z=O(b\log (n/b)) , where n is the text length. We also show that the bound is tight as a function of n , by exhibiting a text family where z = \Omega (b\log n) . Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating \boldsymbol {b} with r , the number of equal-letter runs in the Burrows-Wheeler transform of the text. We continue by observing that Lempel-Ziv is just one particular case of greedy parses-meaning that it obtains the smallest parse by scanning the text and maximizing the phrase length at each step-, and of ordered parses-meaning that phrases are larger than their sources under some order. As a new example of ordered greedy parses, we introduce lexicographical parses, where phrases can only be copied from lexicographically smaller text locations. We prove that the size v
ISSN:	0018-9448 1557-9654
DOI:	10.1109/TIT.2020.3042746