On the MDL principle for i.i.d. sources with large alphabets

Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on information theory 2006-05, Vol.52 (5), p.1939-1955
1. Verfasser:	Shamir, G.I.
Format:	Artikel
Sprache:	eng
Schlagworte:	Alphabets Applied sciences Cities and towns Coding Coding, codes Communication system control Costs Data compression Entropy Estimates Exact sciences and technology Gas insulated transmission lines Independent and identically distributed (i.i.d.) sources Information theory Information, signal and communications theory Lower bounds maximin redundancy Maximum likelihood estimation minimax redundancy Minimax technique Minimax techniques minimum description length (MDL) Quantization Redundancy redundancy for most sources redundancy-capacity theorem Sampling, quantization sequential codes Signal and communications theory Symbols Telecommunications and information theory universal coding
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1955
container_issue	5
container_start_page	1939
container_title	IEEE transactions on information theory
container_volume	52
creator	Shamir, G.I.
description	Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m
doi_str_mv	10.1109/TIT.2006.872846
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_28048187</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>1624633</ieee_id><sourcerecordid>28048187</sourcerecordid><originalsourceid>FETCH-LOGICAL-c355t-ba0275c0e40747fd88b28cafa56bdcfe4d616ce23169718204a91dc77980875c3</originalsourceid><addsrcrecordid>eNp9kDtPwzAURi0EEqUwM7B4AbEktR2_IrGg8qpU1KXMluPc0KC0CXYixL_HVSp1Y7KufL5P9x6ErilJKSX5bL1Yp4wQmWrFNJcnaEKFUEkuBT9FE0KoTnLO9Tm6COErjlxQNkEPqx3uN4Dfn5a48_XO1V0DuGo9rtM6LVMc2sE7CPin7je4sf4TsG26jS2gD5forLJNgKvDO0UfL8_r-VuyXL0u5o_LxGVC9ElhCVPCEeBEcVWVWhdMO1tZIYvSVcBLSaUDllGZK6oZ4TanpVMq10THYDZFd2Nv59vvAUJvtnVw0DR2B-0QDNOEa6pVBO__BWkmBRWcSxbR2Yg634bgoTLx_K31v4YSsxdqolCzF2pGoTFxeyi3wdmm8jbaCsfYft94X-RuRq4GgOO3ZLEjy_4A0iJ75g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1365154462</pqid></control><display><type>article</type><title>On the MDL principle for i.i.d. sources with large alphabets</title><source>IEEE Electronic Library (IEL)</source><creator>Shamir, G.I.</creator><creatorcontrib>Shamir, G.I.</creatorcontrib><description>Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m<min{k,n} alphabet symbols occur</description><identifier>ISSN: 0018-9448</identifier><identifier>EISSN: 1557-9654</identifier><identifier>DOI: 10.1109/TIT.2006.872846</identifier><identifier>CODEN: IETTAW</identifier><language>eng</language><publisher>New York, NY: IEEE</publisher><subject>Alphabets ; Applied sciences ; Cities and towns ; Coding ; Coding, codes ; Communication system control ; Costs ; Data compression ; Entropy ; Estimates ; Exact sciences and technology ; Gas insulated transmission lines ; Independent and identically distributed (i.i.d.) sources ; Information theory ; Information, signal and communications theory ; Lower bounds ; maximin redundancy ; Maximum likelihood estimation ; minimax redundancy ; Minimax technique ; Minimax techniques ; minimum description length (MDL) ; Quantization ; Redundancy ; redundancy for most sources ; redundancy-capacity theorem ; Sampling, quantization ; sequential codes ; Signal and communications theory ; Symbols ; Telecommunications and information theory ; universal coding</subject><ispartof>IEEE transactions on information theory, 2006-05, Vol.52 (5), p.1939-1955</ispartof><rights>2006 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c355t-ba0275c0e40747fd88b28cafa56bdcfe4d616ce23169718204a91dc77980875c3</citedby><cites>FETCH-LOGICAL-c355t-ba0275c0e40747fd88b28cafa56bdcfe4d616ce23169718204a91dc77980875c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/1624633$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/1624633$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=17798747$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Shamir, G.I.</creatorcontrib><title>On the MDL principle for i.i.d. sources with large alphabets</title><title>IEEE transactions on information theory</title><addtitle>TIT</addtitle><description>Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m<min{k,n} alphabet symbols occur</description><subject>Alphabets</subject><subject>Applied sciences</subject><subject>Cities and towns</subject><subject>Coding</subject><subject>Coding, codes</subject><subject>Communication system control</subject><subject>Costs</subject><subject>Data compression</subject><subject>Entropy</subject><subject>Estimates</subject><subject>Exact sciences and technology</subject><subject>Gas insulated transmission lines</subject><subject>Independent and identically distributed (i.i.d.) sources</subject><subject>Information theory</subject><subject>Information, signal and communications theory</subject><subject>Lower bounds</subject><subject>maximin redundancy</subject><subject>Maximum likelihood estimation</subject><subject>minimax redundancy</subject><subject>Minimax technique</subject><subject>Minimax techniques</subject><subject>minimum description length (MDL)</subject><subject>Quantization</subject><subject>Redundancy</subject><subject>redundancy for most sources</subject><subject>redundancy-capacity theorem</subject><subject>Sampling, quantization</subject><subject>sequential codes</subject><subject>Signal and communications theory</subject><subject>Symbols</subject><subject>Telecommunications and information theory</subject><subject>universal coding</subject><issn>0018-9448</issn><issn>1557-9654</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2006</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNp9kDtPwzAURi0EEqUwM7B4AbEktR2_IrGg8qpU1KXMluPc0KC0CXYixL_HVSp1Y7KufL5P9x6ErilJKSX5bL1Yp4wQmWrFNJcnaEKFUEkuBT9FE0KoTnLO9Tm6COErjlxQNkEPqx3uN4Dfn5a48_XO1V0DuGo9rtM6LVMc2sE7CPin7je4sf4TsG26jS2gD5forLJNgKvDO0UfL8_r-VuyXL0u5o_LxGVC9ElhCVPCEeBEcVWVWhdMO1tZIYvSVcBLSaUDllGZK6oZ4TanpVMq10THYDZFd2Nv59vvAUJvtnVw0DR2B-0QDNOEa6pVBO__BWkmBRWcSxbR2Yg634bgoTLx_K31v4YSsxdqolCzF2pGoTFxeyi3wdmm8jbaCsfYft94X-RuRq4GgOO3ZLEjy_4A0iJ75g</recordid><startdate>20060501</startdate><enddate>20060501</enddate><creator>Shamir, G.I.</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20060501</creationdate><title>On the MDL principle for i.i.d. sources with large alphabets</title><author>Shamir, G.I.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c355t-ba0275c0e40747fd88b28cafa56bdcfe4d616ce23169718204a91dc77980875c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2006</creationdate><topic>Alphabets</topic><topic>Applied sciences</topic><topic>Cities and towns</topic><topic>Coding</topic><topic>Coding, codes</topic><topic>Communication system control</topic><topic>Costs</topic><topic>Data compression</topic><topic>Entropy</topic><topic>Estimates</topic><topic>Exact sciences and technology</topic><topic>Gas insulated transmission lines</topic><topic>Independent and identically distributed (i.i.d.) sources</topic><topic>Information theory</topic><topic>Information, signal and communications theory</topic><topic>Lower bounds</topic><topic>maximin redundancy</topic><topic>Maximum likelihood estimation</topic><topic>minimax redundancy</topic><topic>Minimax technique</topic><topic>Minimax techniques</topic><topic>minimum description length (MDL)</topic><topic>Quantization</topic><topic>Redundancy</topic><topic>redundancy for most sources</topic><topic>redundancy-capacity theorem</topic><topic>Sampling, quantization</topic><topic>sequential codes</topic><topic>Signal and communications theory</topic><topic>Symbols</topic><topic>Telecommunications and information theory</topic><topic>universal coding</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shamir, G.I.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on information theory</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shamir, G.I.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On the MDL principle for i.i.d. sources with large alphabets</atitle><jtitle>IEEE transactions on information theory</jtitle><stitle>TIT</stitle><date>2006-05-01</date><risdate>2006</risdate><volume>52</volume><issue>5</issue><spage>1939</spage><epage>1955</epage><pages>1939-1955</pages><issn>0018-9448</issn><eissn>1557-9654</eissn><coden>IETTAW</coden><abstract>Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen's strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m<min{k,n} alphabet symbols occur</abstract><cop>New York, NY</cop><pub>IEEE</pub><doi>10.1109/TIT.2006.872846</doi><tpages>17</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0018-9448
ispartof	IEEE transactions on information theory, 2006-05, Vol.52 (5), p.1939-1955
issn	0018-9448 1557-9654
language	eng
recordid	cdi_proquest_miscellaneous_28048187
source	IEEE Electronic Library (IEL)
subjects	Alphabets Applied sciences Cities and towns Coding Coding, codes Communication system control Costs Data compression Entropy Estimates Exact sciences and technology Gas insulated transmission lines Independent and identically distributed (i.i.d.) sources Information theory Information, signal and communications theory Lower bounds maximin redundancy Maximum likelihood estimation minimax redundancy Minimax technique Minimax techniques minimum description length (MDL) Quantization Redundancy redundancy for most sources redundancy-capacity theorem Sampling, quantization sequential codes Signal and communications theory Symbols Telecommunications and information theory universal coding
title	On the MDL principle for i.i.d. sources with large alphabets
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T22%3A30%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20the%20MDL%20principle%20for%20i.i.d.%20sources%20with%20large%20alphabets&rft.jtitle=IEEE%20transactions%20on%20information%20theory&rft.au=Shamir,%20G.I.&rft.date=2006-05-01&rft.volume=52&rft.issue=5&rft.spage=1939&rft.epage=1955&rft.pages=1939-1955&rft.issn=0018-9448&rft.eissn=1557-9654&rft.coden=IETTAW&rft_id=info:doi/10.1109/TIT.2006.872846&rft_dat=%3Cproquest_RIE%3E28048187%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1365154462&rft_id=info:pmid/&rft_ieee_id=1624633&rfr_iscdi=true