On the Approximation Ratio of Ordered Parsings

Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is \boldsymbol {b} , the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on information theory 2021-02, Vol.67 (2), p.1008-1026
Hauptverfasser: Navarro, Gonzalo, Ochoa, Carlos, Prezza, Nicola
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1026
container_issue 2
container_start_page 1008
container_title IEEE transactions on information theory
container_volume 67
creator Navarro, Gonzalo
Ochoa, Carlos
Prezza, Nicola
description Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is \boldsymbol {b} , the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing \boldsymbol {b} is NP-complete, a popular gold standard is \boldsymbol {z} , the number of phrases in the Lempel-Ziv parse of the text, which is computed in linear time and yields the least number of phrases when those can be copied only from the left. Almost nothing has been known for decades about the approximation ratio of \boldsymbol {z} with respect to \boldsymbol {b} . In this paper we prove that z=O(b\log (n/b)) , where n is the text length. We also show that the bound is tight as a function of n , by exhibiting a text family where z = \Omega (b\log n) . Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating \boldsymbol {b} with r , the number of equal-letter runs in the Burrows-Wheeler transform of the text. We continue by observing that Lempel-Ziv is just one particular case of greedy parses-meaning that it obtains the smallest parse by scanning the text and maximizing the phrase length at each step-, and of ordered parses-meaning that phrases are larger than their sources under some order. As a new example of ordered greedy parses, we introduce lexicographical parses, where phrases can only be copied from lexicographically smaller text locations. We prove that the size v
doi_str_mv 10.1109/TIT.2020.3042746
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9281349</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9281349</ieee_id><sourcerecordid>2480872051</sourcerecordid><originalsourceid>FETCH-LOGICAL-c357t-98cb9165cb4c9ef5f8befb02af15ba0486683075b659f76a32e8bf51747a6bac3</originalsourceid><addsrcrecordid>eNo9kM1LAzEQxYMouFbvgpcFz7tOvpNjKVoLhYrUc0i2iW7R3ZpsQf97s2zx9Bh4b2beD6FbDDXGoB-2q21NgEBNgRHJxBkqMOey0oKzc1QAYFVpxtQlukppn0fGMSlQvenK4cOX88Mh9j_tlx3avitfRyn7UG7izke_K19sTG33nq7RRbCfyd-cdIbenh63i-dqvVmuFvN11VAuh0qrxmkseONYo33gQTkfHBAbMHcWmBJCUZDcCa6DFJYSr1zgWDJphbMNnaH7aW_-6vvo02D2_TF2-aQhTIGSBDjOLphcTexTij6YQ8wV4q_BYEYqJlMxIxVzopIjd1Ok9d7_2zVRmDJN_wCPZlxe</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2480872051</pqid></control><display><type>article</type><title>On the Approximation Ratio of Ordered Parsings</title><source>IEEE Electronic Library (IEL)</source><creator>Navarro, Gonzalo ; Ochoa, Carlos ; Prezza, Nicola</creator><creatorcontrib>Navarro, Gonzalo ; Ochoa, Carlos ; Prezza, Nicola</creatorcontrib><description><![CDATA[Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula>, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula> is NP-complete, a popular gold standard is <inline-formula> <tex-math notation="LaTeX">\boldsymbol {z} </tex-math></inline-formula>, the number of phrases in the Lempel-Ziv parse of the text, which is computed in linear time and yields the least number of phrases when those can be copied only from the left. Almost nothing has been known for decades about the approximation ratio of <inline-formula> <tex-math notation="LaTeX">\boldsymbol {z} </tex-math></inline-formula> with respect to <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula>. In this paper we prove that <inline-formula> <tex-math notation="LaTeX">z=O(b\log (n/b)) </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> is the text length. We also show that the bound is tight as a function of <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>, by exhibiting a text family where <inline-formula> <tex-math notation="LaTeX">z = \Omega (b\log n) </tex-math></inline-formula>. Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula> with <inline-formula> <tex-math notation="LaTeX">r </tex-math></inline-formula>, the number of equal-letter runs in the Burrows-Wheeler transform of the text. We continue by observing that Lempel-Ziv is just one particular case of greedy parses-meaning that it obtains the smallest parse by scanning the text and maximizing the phrase length at each step-, and of ordered parses-meaning that phrases are larger than their sources under some order. As a new example of ordered greedy parses, we introduce lexicographical parses, where phrases can only be copied from lexicographically smaller text locations. We prove that the size <inline-formula> <tex-math notation="LaTeX">v </tex-math></inline-formula> of the optimal lexicographical parse is also obtained greedily in <inline-formula> <tex-math notation="LaTeX">O(n) </tex-math></inline-formula> time, that <inline-formula> <tex-math notation="LaTeX">v=O(b\log (n/b)) </tex-math></inline-formula>, and that there exists a text family where <inline-formula> <tex-math notation="LaTeX">v = \Omega (b\log n) </tex-math></inline-formula>. Interestingly, we also show that <inline-formula> <tex-math notation="LaTeX">v = O(r) </tex-math></inline-formula> because <inline-formula> <tex-math notation="LaTeX">r </tex-math></inline-formula> also induces a lexicographical parse, whereas <inline-formula> <tex-math notation="LaTeX">z = \Omega (r\log n) </tex-math></inline-formula> holds on some text families. We obtain some results on parsing complexity and size that hold on some general classes of greedy ordered parses. In our way, we also prove other relevant bounds between compressibility measures, especially with those related to smallest grammars of various types generating (only) the text.]]></description><identifier>ISSN: 0018-9448</identifier><identifier>EISSN: 1557-9654</identifier><identifier>DOI: 10.1109/TIT.2020.3042746</identifier><identifier>CODEN: IETTAW</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Approximation ; Burrows-Wheeler transform ; collage systems ; Compressibility ; context-free grammars ; Entropy ; Entropy (Information theory) ; Genomics ; Grammar ; Grammars ; greedy parsing ; Internet ; Lempel-Ziv complexity ; lexicographic parsing ; Lower bounds ; Mathematical analysis ; optimal bidirectional parsing ; Optimization ; ordered parsing ; Probabilistic logic ; repetitive sequences ; run-length compressed Burrows-Wheeler transform ; Transforms ; Upper bound ; Upper bounds</subject><ispartof>IEEE transactions on information theory, 2021-02, Vol.67 (2), p.1008-1026</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c357t-98cb9165cb4c9ef5f8befb02af15ba0486683075b659f76a32e8bf51747a6bac3</citedby><cites>FETCH-LOGICAL-c357t-98cb9165cb4c9ef5f8befb02af15ba0486683075b659f76a32e8bf51747a6bac3</cites><orcidid>0000-0002-1366-3028 ; 0000-0002-2286-741X ; 0000-0003-3553-4953</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9281349$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9281349$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Navarro, Gonzalo</creatorcontrib><creatorcontrib>Ochoa, Carlos</creatorcontrib><creatorcontrib>Prezza, Nicola</creatorcontrib><title>On the Approximation Ratio of Ordered Parsings</title><title>IEEE transactions on information theory</title><addtitle>TIT</addtitle><description><![CDATA[Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula>, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula> is NP-complete, a popular gold standard is <inline-formula> <tex-math notation="LaTeX">\boldsymbol {z} </tex-math></inline-formula>, the number of phrases in the Lempel-Ziv parse of the text, which is computed in linear time and yields the least number of phrases when those can be copied only from the left. Almost nothing has been known for decades about the approximation ratio of <inline-formula> <tex-math notation="LaTeX">\boldsymbol {z} </tex-math></inline-formula> with respect to <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula>. In this paper we prove that <inline-formula> <tex-math notation="LaTeX">z=O(b\log (n/b)) </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> is the text length. We also show that the bound is tight as a function of <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>, by exhibiting a text family where <inline-formula> <tex-math notation="LaTeX">z = \Omega (b\log n) </tex-math></inline-formula>. Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula> with <inline-formula> <tex-math notation="LaTeX">r </tex-math></inline-formula>, the number of equal-letter runs in the Burrows-Wheeler transform of the text. We continue by observing that Lempel-Ziv is just one particular case of greedy parses-meaning that it obtains the smallest parse by scanning the text and maximizing the phrase length at each step-, and of ordered parses-meaning that phrases are larger than their sources under some order. As a new example of ordered greedy parses, we introduce lexicographical parses, where phrases can only be copied from lexicographically smaller text locations. We prove that the size <inline-formula> <tex-math notation="LaTeX">v </tex-math></inline-formula> of the optimal lexicographical parse is also obtained greedily in <inline-formula> <tex-math notation="LaTeX">O(n) </tex-math></inline-formula> time, that <inline-formula> <tex-math notation="LaTeX">v=O(b\log (n/b)) </tex-math></inline-formula>, and that there exists a text family where <inline-formula> <tex-math notation="LaTeX">v = \Omega (b\log n) </tex-math></inline-formula>. Interestingly, we also show that <inline-formula> <tex-math notation="LaTeX">v = O(r) </tex-math></inline-formula> because <inline-formula> <tex-math notation="LaTeX">r </tex-math></inline-formula> also induces a lexicographical parse, whereas <inline-formula> <tex-math notation="LaTeX">z = \Omega (r\log n) </tex-math></inline-formula> holds on some text families. We obtain some results on parsing complexity and size that hold on some general classes of greedy ordered parses. In our way, we also prove other relevant bounds between compressibility measures, especially with those related to smallest grammars of various types generating (only) the text.]]></description><subject>Approximation</subject><subject>Burrows-Wheeler transform</subject><subject>collage systems</subject><subject>Compressibility</subject><subject>context-free grammars</subject><subject>Entropy</subject><subject>Entropy (Information theory)</subject><subject>Genomics</subject><subject>Grammar</subject><subject>Grammars</subject><subject>greedy parsing</subject><subject>Internet</subject><subject>Lempel-Ziv complexity</subject><subject>lexicographic parsing</subject><subject>Lower bounds</subject><subject>Mathematical analysis</subject><subject>optimal bidirectional parsing</subject><subject>Optimization</subject><subject>ordered parsing</subject><subject>Probabilistic logic</subject><subject>repetitive sequences</subject><subject>run-length compressed Burrows-Wheeler transform</subject><subject>Transforms</subject><subject>Upper bound</subject><subject>Upper bounds</subject><issn>0018-9448</issn><issn>1557-9654</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kM1LAzEQxYMouFbvgpcFz7tOvpNjKVoLhYrUc0i2iW7R3ZpsQf97s2zx9Bh4b2beD6FbDDXGoB-2q21NgEBNgRHJxBkqMOey0oKzc1QAYFVpxtQlukppn0fGMSlQvenK4cOX88Mh9j_tlx3avitfRyn7UG7izke_K19sTG33nq7RRbCfyd-cdIbenh63i-dqvVmuFvN11VAuh0qrxmkseONYo33gQTkfHBAbMHcWmBJCUZDcCa6DFJYSr1zgWDJphbMNnaH7aW_-6vvo02D2_TF2-aQhTIGSBDjOLphcTexTij6YQ8wV4q_BYEYqJlMxIxVzopIjd1Ok9d7_2zVRmDJN_wCPZlxe</recordid><startdate>20210201</startdate><enddate>20210201</enddate><creator>Navarro, Gonzalo</creator><creator>Ochoa, Carlos</creator><creator>Prezza, Nicola</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-1366-3028</orcidid><orcidid>https://orcid.org/0000-0002-2286-741X</orcidid><orcidid>https://orcid.org/0000-0003-3553-4953</orcidid></search><sort><creationdate>20210201</creationdate><title>On the Approximation Ratio of Ordered Parsings</title><author>Navarro, Gonzalo ; Ochoa, Carlos ; Prezza, Nicola</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c357t-98cb9165cb4c9ef5f8befb02af15ba0486683075b659f76a32e8bf51747a6bac3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Approximation</topic><topic>Burrows-Wheeler transform</topic><topic>collage systems</topic><topic>Compressibility</topic><topic>context-free grammars</topic><topic>Entropy</topic><topic>Entropy (Information theory)</topic><topic>Genomics</topic><topic>Grammar</topic><topic>Grammars</topic><topic>greedy parsing</topic><topic>Internet</topic><topic>Lempel-Ziv complexity</topic><topic>lexicographic parsing</topic><topic>Lower bounds</topic><topic>Mathematical analysis</topic><topic>optimal bidirectional parsing</topic><topic>Optimization</topic><topic>ordered parsing</topic><topic>Probabilistic logic</topic><topic>repetitive sequences</topic><topic>run-length compressed Burrows-Wheeler transform</topic><topic>Transforms</topic><topic>Upper bound</topic><topic>Upper bounds</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Navarro, Gonzalo</creatorcontrib><creatorcontrib>Ochoa, Carlos</creatorcontrib><creatorcontrib>Prezza, Nicola</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on information theory</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Navarro, Gonzalo</au><au>Ochoa, Carlos</au><au>Prezza, Nicola</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On the Approximation Ratio of Ordered Parsings</atitle><jtitle>IEEE transactions on information theory</jtitle><stitle>TIT</stitle><date>2021-02-01</date><risdate>2021</risdate><volume>67</volume><issue>2</issue><spage>1008</spage><epage>1026</epage><pages>1008-1026</pages><issn>0018-9448</issn><eissn>1557-9654</eissn><coden>IETTAW</coden><abstract><![CDATA[Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula>, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula> is NP-complete, a popular gold standard is <inline-formula> <tex-math notation="LaTeX">\boldsymbol {z} </tex-math></inline-formula>, the number of phrases in the Lempel-Ziv parse of the text, which is computed in linear time and yields the least number of phrases when those can be copied only from the left. Almost nothing has been known for decades about the approximation ratio of <inline-formula> <tex-math notation="LaTeX">\boldsymbol {z} </tex-math></inline-formula> with respect to <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula>. In this paper we prove that <inline-formula> <tex-math notation="LaTeX">z=O(b\log (n/b)) </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> is the text length. We also show that the bound is tight as a function of <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>, by exhibiting a text family where <inline-formula> <tex-math notation="LaTeX">z = \Omega (b\log n) </tex-math></inline-formula>. Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating <inline-formula> <tex-math notation="LaTeX">\boldsymbol {b} </tex-math></inline-formula> with <inline-formula> <tex-math notation="LaTeX">r </tex-math></inline-formula>, the number of equal-letter runs in the Burrows-Wheeler transform of the text. We continue by observing that Lempel-Ziv is just one particular case of greedy parses-meaning that it obtains the smallest parse by scanning the text and maximizing the phrase length at each step-, and of ordered parses-meaning that phrases are larger than their sources under some order. As a new example of ordered greedy parses, we introduce lexicographical parses, where phrases can only be copied from lexicographically smaller text locations. We prove that the size <inline-formula> <tex-math notation="LaTeX">v </tex-math></inline-formula> of the optimal lexicographical parse is also obtained greedily in <inline-formula> <tex-math notation="LaTeX">O(n) </tex-math></inline-formula> time, that <inline-formula> <tex-math notation="LaTeX">v=O(b\log (n/b)) </tex-math></inline-formula>, and that there exists a text family where <inline-formula> <tex-math notation="LaTeX">v = \Omega (b\log n) </tex-math></inline-formula>. Interestingly, we also show that <inline-formula> <tex-math notation="LaTeX">v = O(r) </tex-math></inline-formula> because <inline-formula> <tex-math notation="LaTeX">r </tex-math></inline-formula> also induces a lexicographical parse, whereas <inline-formula> <tex-math notation="LaTeX">z = \Omega (r\log n) </tex-math></inline-formula> holds on some text families. We obtain some results on parsing complexity and size that hold on some general classes of greedy ordered parses. In our way, we also prove other relevant bounds between compressibility measures, especially with those related to smallest grammars of various types generating (only) the text.]]></abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TIT.2020.3042746</doi><tpages>19</tpages><orcidid>https://orcid.org/0000-0002-1366-3028</orcidid><orcidid>https://orcid.org/0000-0002-2286-741X</orcidid><orcidid>https://orcid.org/0000-0003-3553-4953</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 0018-9448
ispartof IEEE transactions on information theory, 2021-02, Vol.67 (2), p.1008-1026
issn 0018-9448
1557-9654
language eng
recordid cdi_ieee_primary_9281349
source IEEE Electronic Library (IEL)
subjects Approximation
Burrows-Wheeler transform
collage systems
Compressibility
context-free grammars
Entropy
Entropy (Information theory)
Genomics
Grammar
Grammars
greedy parsing
Internet
Lempel-Ziv complexity
lexicographic parsing
Lower bounds
Mathematical analysis
optimal bidirectional parsing
Optimization
ordered parsing
Probabilistic logic
repetitive sequences
run-length compressed Burrows-Wheeler transform
Transforms
Upper bound
Upper bounds
title On the Approximation Ratio of Ordered Parsings
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T14%3A51%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20the%20Approximation%20Ratio%20of%20Ordered%20Parsings&rft.jtitle=IEEE%20transactions%20on%20information%20theory&rft.au=Navarro,%20Gonzalo&rft.date=2021-02-01&rft.volume=67&rft.issue=2&rft.spage=1008&rft.epage=1026&rft.pages=1008-1026&rft.issn=0018-9448&rft.eissn=1557-9654&rft.coden=IETTAW&rft_id=info:doi/10.1109/TIT.2020.3042746&rft_dat=%3Cproquest_RIE%3E2480872051%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2480872051&rft_id=info:pmid/&rft_ieee_id=9281349&rfr_iscdi=true