Compressed full-text indexes
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l...
Gespeichert in:
Veröffentlicht in: | ACM computing surveys 2007-04, Vol.39 (1), p.2 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | 1 |
container_start_page | 2 |
container_title | ACM computing surveys |
container_volume | 39 |
creator | NAVARRO, Gonzalo MÄKINEN, Veli |
description | Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into
self-indexes
, which in addition contain enough information to reproduce any text portion, so they
replace
the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, which radically changed the status of this area in less than 5 years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously.
In this article we present the main concepts underlying (compressed) self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems. Our aim is to give the background to understand and follow the developments in this area. |
doi_str_mv | 10.1145/1216370.1216372 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_34556892</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>34556892</sourcerecordid><originalsourceid>FETCH-LOGICAL-c409t-f4b4fbcbcb903fcf8ea4dc494d0c1bdffd0bab5cd7666aa052cea126436e55f83</originalsourceid><addsrcrecordid>eNpFkM1Lw0AUxBdRMFbPXjz0orfY97IfSY5SrAoFL3oOm923EEmTuC-F-t-b2oDM4cfAzBxGiFuER0SlV5ihkflk_pidiQS1ztNcKjwXCUgDKUiAS3HF_AUAmUKTiLt1vxsiMZNfhn3bpiMdxmXTeToQX4uLYFumm5kL8bl5_li_ptv3l7f10zZ1CsoxDapWoXaTSpDBhYKs8k6VyoPD2ofgoba1dj43xlgLOnNkMTNKGtI6FHIhHk67Q-y_98RjtWvYUdvajvo9V1JpbYoym4KrU9DFnjlSqIbY7Gz8qRCq4wvV_MLMY-N-nrbsbBui7VzD_7UixwIlyl8Gx1vU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>34556892</pqid></control><display><type>article</type><title>Compressed full-text indexes</title><source>Access via ACM Digital Library</source><creator>NAVARRO, Gonzalo ; MÄKINEN, Veli</creator><creatorcontrib>NAVARRO, Gonzalo ; MÄKINEN, Veli</creatorcontrib><description>Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into
self-indexes
, which in addition contain enough information to reproduce any text portion, so they
replace
the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, which radically changed the status of this area in less than 5 years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously.
In this article we present the main concepts underlying (compressed) self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems. Our aim is to give the background to understand and follow the developments in this area.</description><identifier>ISSN: 0360-0300</identifier><identifier>EISSN: 1557-7341</identifier><identifier>DOI: 10.1145/1216370.1216372</identifier><identifier>CODEN: CMSVAN</identifier><language>eng</language><publisher>New York, NY: Association for Computing Machinery</publisher><subject>Applied sciences ; Artificial intelligence ; Computer science; control theory; systems ; Exact sciences and technology ; Information systems. Data bases ; Memory organisation. Data processing ; Software ; Speech and sound recognition and synthesis. Linguistics</subject><ispartof>ACM computing surveys, 2007-04, Vol.39 (1), p.2</ispartof><rights>2007 INIST-CNRS</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c409t-f4b4fbcbcb903fcf8ea4dc494d0c1bdffd0bab5cd7666aa052cea126436e55f83</citedby><cites>FETCH-LOGICAL-c409t-f4b4fbcbcb903fcf8ea4dc494d0c1bdffd0bab5cd7666aa052cea126436e55f83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=18718131$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>NAVARRO, Gonzalo</creatorcontrib><creatorcontrib>MÄKINEN, Veli</creatorcontrib><title>Compressed full-text indexes</title><title>ACM computing surveys</title><description>Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into
self-indexes
, which in addition contain enough information to reproduce any text portion, so they
replace
the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, which radically changed the status of this area in less than 5 years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously.
In this article we present the main concepts underlying (compressed) self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems. Our aim is to give the background to understand and follow the developments in this area.</description><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Computer science; control theory; systems</subject><subject>Exact sciences and technology</subject><subject>Information systems. Data bases</subject><subject>Memory organisation. Data processing</subject><subject>Software</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><issn>0360-0300</issn><issn>1557-7341</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2007</creationdate><recordtype>article</recordtype><recordid>eNpFkM1Lw0AUxBdRMFbPXjz0orfY97IfSY5SrAoFL3oOm923EEmTuC-F-t-b2oDM4cfAzBxGiFuER0SlV5ihkflk_pidiQS1ztNcKjwXCUgDKUiAS3HF_AUAmUKTiLt1vxsiMZNfhn3bpiMdxmXTeToQX4uLYFumm5kL8bl5_li_ptv3l7f10zZ1CsoxDapWoXaTSpDBhYKs8k6VyoPD2ofgoba1dj43xlgLOnNkMTNKGtI6FHIhHk67Q-y_98RjtWvYUdvajvo9V1JpbYoym4KrU9DFnjlSqIbY7Gz8qRCq4wvV_MLMY-N-nrbsbBui7VzD_7UixwIlyl8Gx1vU</recordid><startdate>20070412</startdate><enddate>20070412</enddate><creator>NAVARRO, Gonzalo</creator><creator>MÄKINEN, Veli</creator><general>Association for Computing Machinery</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20070412</creationdate><title>Compressed full-text indexes</title><author>NAVARRO, Gonzalo ; MÄKINEN, Veli</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c409t-f4b4fbcbcb903fcf8ea4dc494d0c1bdffd0bab5cd7666aa052cea126436e55f83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2007</creationdate><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Computer science; control theory; systems</topic><topic>Exact sciences and technology</topic><topic>Information systems. Data bases</topic><topic>Memory organisation. Data processing</topic><topic>Software</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>NAVARRO, Gonzalo</creatorcontrib><creatorcontrib>MÄKINEN, Veli</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>ACM computing surveys</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>NAVARRO, Gonzalo</au><au>MÄKINEN, Veli</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Compressed full-text indexes</atitle><jtitle>ACM computing surveys</jtitle><date>2007-04-12</date><risdate>2007</risdate><volume>39</volume><issue>1</issue><spage>2</spage><pages>2-</pages><issn>0360-0300</issn><eissn>1557-7341</eissn><coden>CMSVAN</coden><abstract>Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into
self-indexes
, which in addition contain enough information to reproduce any text portion, so they
replace
the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, which radically changed the status of this area in less than 5 years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously.
In this article we present the main concepts underlying (compressed) self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems. Our aim is to give the background to understand and follow the developments in this area.</abstract><cop>New York, NY</cop><pub>Association for Computing Machinery</pub><doi>10.1145/1216370.1216372</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0360-0300 |
ispartof | ACM computing surveys, 2007-04, Vol.39 (1), p.2 |
issn | 0360-0300 1557-7341 |
language | eng |
recordid | cdi_proquest_miscellaneous_34556892 |
source | Access via ACM Digital Library |
subjects | Applied sciences Artificial intelligence Computer science control theory systems Exact sciences and technology Information systems. Data bases Memory organisation. Data processing Software Speech and sound recognition and synthesis. Linguistics |
title | Compressed full-text indexes |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T15%3A00%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Compressed%20full-text%20indexes&rft.jtitle=ACM%20computing%20surveys&rft.au=NAVARRO,%20Gonzalo&rft.date=2007-04-12&rft.volume=39&rft.issue=1&rft.spage=2&rft.pages=2-&rft.issn=0360-0300&rft.eissn=1557-7341&rft.coden=CMSVAN&rft_id=info:doi/10.1145/1216370.1216372&rft_dat=%3Cproquest_cross%3E34556892%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=34556892&rft_id=info:pmid/&rfr_iscdi=true |