On the MDL principle for i.i.d. sources with large alphabets

Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on information theory 2006-05, Vol.52 (5), p.1939-1955
1. Verfasser: Shamir, G.I.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1955
container_issue 5
container_start_page 1939
container_title IEEE transactions on information theory
container_volume 52
creator Shamir, G.I.
description Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m
doi_str_mv 10.1109/TIT.2006.872846
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_28048187</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>1624633</ieee_id><sourcerecordid>28048187</sourcerecordid><originalsourceid>FETCH-LOGICAL-c355t-ba0275c0e40747fd88b28cafa56bdcfe4d616ce23169718204a91dc77980875c3</originalsourceid><addsrcrecordid>eNp9kDtPwzAURi0EEqUwM7B4AbEktR2_IrGg8qpU1KXMluPc0KC0CXYixL_HVSp1Y7KufL5P9x6ErilJKSX5bL1Yp4wQmWrFNJcnaEKFUEkuBT9FE0KoTnLO9Tm6COErjlxQNkEPqx3uN4Dfn5a48_XO1V0DuGo9rtM6LVMc2sE7CPin7je4sf4TsG26jS2gD5forLJNgKvDO0UfL8_r-VuyXL0u5o_LxGVC9ElhCVPCEeBEcVWVWhdMO1tZIYvSVcBLSaUDllGZK6oZ4TanpVMq10THYDZFd2Nv59vvAUJvtnVw0DR2B-0QDNOEa6pVBO__BWkmBRWcSxbR2Yg634bgoTLx_K31v4YSsxdqolCzF2pGoTFxeyi3wdmm8jbaCsfYft94X-RuRq4GgOO3ZLEjy_4A0iJ75g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1365154462</pqid></control><display><type>article</type><title>On the MDL principle for i.i.d. sources with large alphabets</title><source>IEEE Electronic Library (IEL)</source><creator>Shamir, G.I.</creator><creatorcontrib>Shamir, G.I.</creatorcontrib><description>Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m&lt;min{k,n} alphabet symbols occur</description><identifier>ISSN: 0018-9448</identifier><identifier>EISSN: 1557-9654</identifier><identifier>DOI: 10.1109/TIT.2006.872846</identifier><identifier>CODEN: IETTAW</identifier><language>eng</language><publisher>New York, NY: IEEE</publisher><subject>Alphabets ; Applied sciences ; Cities and towns ; Coding ; Coding, codes ; Communication system control ; Costs ; Data compression ; Entropy ; Estimates ; Exact sciences and technology ; Gas insulated transmission lines ; Independent and identically distributed (i.i.d.) sources ; Information theory ; Information, signal and communications theory ; Lower bounds ; maximin redundancy ; Maximum likelihood estimation ; minimax redundancy ; Minimax technique ; Minimax techniques ; minimum description length (MDL) ; Quantization ; Redundancy ; redundancy for most sources ; redundancy-capacity theorem ; Sampling, quantization ; sequential codes ; Signal and communications theory ; Symbols ; Telecommunications and information theory ; universal coding</subject><ispartof>IEEE transactions on information theory, 2006-05, Vol.52 (5), p.1939-1955</ispartof><rights>2006 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c355t-ba0275c0e40747fd88b28cafa56bdcfe4d616ce23169718204a91dc77980875c3</citedby><cites>FETCH-LOGICAL-c355t-ba0275c0e40747fd88b28cafa56bdcfe4d616ce23169718204a91dc77980875c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/1624633$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/1624633$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=17798747$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Shamir, G.I.</creatorcontrib><title>On the MDL principle for i.i.d. sources with large alphabets</title><title>IEEE transactions on information theory</title><addtitle>TIT</addtitle><description>Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m&lt;min{k,n} alphabet symbols occur</description><subject>Alphabets</subject><subject>Applied sciences</subject><subject>Cities and towns</subject><subject>Coding</subject><subject>Coding, codes</subject><subject>Communication system control</subject><subject>Costs</subject><subject>Data compression</subject><subject>Entropy</subject><subject>Estimates</subject><subject>Exact sciences and technology</subject><subject>Gas insulated transmission lines</subject><subject>Independent and identically distributed (i.i.d.) sources</subject><subject>Information theory</subject><subject>Information, signal and communications theory</subject><subject>Lower bounds</subject><subject>maximin redundancy</subject><subject>Maximum likelihood estimation</subject><subject>minimax redundancy</subject><subject>Minimax technique</subject><subject>Minimax techniques</subject><subject>minimum description length (MDL)</subject><subject>Quantization</subject><subject>Redundancy</subject><subject>redundancy for most sources</subject><subject>redundancy-capacity theorem</subject><subject>Sampling, quantization</subject><subject>sequential codes</subject><subject>Signal and communications theory</subject><subject>Symbols</subject><subject>Telecommunications and information theory</subject><subject>universal coding</subject><issn>0018-9448</issn><issn>1557-9654</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2006</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNp9kDtPwzAURi0EEqUwM7B4AbEktR2_IrGg8qpU1KXMluPc0KC0CXYixL_HVSp1Y7KufL5P9x6ErilJKSX5bL1Yp4wQmWrFNJcnaEKFUEkuBT9FE0KoTnLO9Tm6COErjlxQNkEPqx3uN4Dfn5a48_XO1V0DuGo9rtM6LVMc2sE7CPin7je4sf4TsG26jS2gD5forLJNgKvDO0UfL8_r-VuyXL0u5o_LxGVC9ElhCVPCEeBEcVWVWhdMO1tZIYvSVcBLSaUDllGZK6oZ4TanpVMq10THYDZFd2Nv59vvAUJvtnVw0DR2B-0QDNOEa6pVBO__BWkmBRWcSxbR2Yg634bgoTLx_K31v4YSsxdqolCzF2pGoTFxeyi3wdmm8jbaCsfYft94X-RuRq4GgOO3ZLEjy_4A0iJ75g</recordid><startdate>20060501</startdate><enddate>20060501</enddate><creator>Shamir, G.I.</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20060501</creationdate><title>On the MDL principle for i.i.d. sources with large alphabets</title><author>Shamir, G.I.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c355t-ba0275c0e40747fd88b28cafa56bdcfe4d616ce23169718204a91dc77980875c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2006</creationdate><topic>Alphabets</topic><topic>Applied sciences</topic><topic>Cities and towns</topic><topic>Coding</topic><topic>Coding, codes</topic><topic>Communication system control</topic><topic>Costs</topic><topic>Data compression</topic><topic>Entropy</topic><topic>Estimates</topic><topic>Exact sciences and technology</topic><topic>Gas insulated transmission lines</topic><topic>Independent and identically distributed (i.i.d.) sources</topic><topic>Information theory</topic><topic>Information, signal and communications theory</topic><topic>Lower bounds</topic><topic>maximin redundancy</topic><topic>Maximum likelihood estimation</topic><topic>minimax redundancy</topic><topic>Minimax technique</topic><topic>Minimax techniques</topic><topic>minimum description length (MDL)</topic><topic>Quantization</topic><topic>Redundancy</topic><topic>redundancy for most sources</topic><topic>redundancy-capacity theorem</topic><topic>Sampling, quantization</topic><topic>sequential codes</topic><topic>Signal and communications theory</topic><topic>Symbols</topic><topic>Telecommunications and information theory</topic><topic>universal coding</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shamir, G.I.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on information theory</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shamir, G.I.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On the MDL principle for i.i.d. sources with large alphabets</atitle><jtitle>IEEE transactions on information theory</jtitle><stitle>TIT</stitle><date>2006-05-01</date><risdate>2006</risdate><volume>52</volume><issue>5</issue><spage>1939</spage><epage>1955</epage><pages>1939-1955</pages><issn>0018-9448</issn><eissn>1557-9654</eissn><coden>IETTAW</coden><abstract>Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m&lt;min{k,n} alphabet symbols occur</abstract><cop>New York, NY</cop><pub>IEEE</pub><doi>10.1109/TIT.2006.872846</doi><tpages>17</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 0018-9448
ispartof IEEE transactions on information theory, 2006-05, Vol.52 (5), p.1939-1955
issn 0018-9448
1557-9654
language eng
recordid cdi_proquest_miscellaneous_28048187
source IEEE Electronic Library (IEL)
subjects Alphabets
Applied sciences
Cities and towns
Coding
Coding, codes
Communication system control
Costs
Data compression
Entropy
Estimates
Exact sciences and technology
Gas insulated transmission lines
Independent and identically distributed (i.i.d.) sources
Information theory
Information, signal and communications theory
Lower bounds
maximin redundancy
Maximum likelihood estimation
minimax redundancy
Minimax technique
Minimax techniques
minimum description length (MDL)
Quantization
Redundancy
redundancy for most sources
redundancy-capacity theorem
Sampling, quantization
sequential codes
Signal and communications theory
Symbols
Telecommunications and information theory
universal coding
title On the MDL principle for i.i.d. sources with large alphabets
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T22%3A30%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20the%20MDL%20principle%20for%20i.i.d.%20sources%20with%20large%20alphabets&rft.jtitle=IEEE%20transactions%20on%20information%20theory&rft.au=Shamir,%20G.I.&rft.date=2006-05-01&rft.volume=52&rft.issue=5&rft.spage=1939&rft.epage=1955&rft.pages=1939-1955&rft.issn=0018-9448&rft.eissn=1557-9654&rft.coden=IETTAW&rft_id=info:doi/10.1109/TIT.2006.872846&rft_dat=%3Cproquest_RIE%3E28048187%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1365154462&rft_id=info:pmid/&rft_ieee_id=1624633&rfr_iscdi=true