Approximate string matching using compressed suffix arrays

Let T be a text of length n and P be a pattern of length m , both strings over a fixed finite alphabet A . The k -difference ( k -mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P . In this paper we investig...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Theoretical computer science 2006-03, Vol.352 (1), p.240-249
Hauptverfasser:	Huynh, Trinh N.D., Hon, Wing-Kai, Lam, Tak-Wah, Sung, Wing-Kin
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithmics. Computability. Computer arithmetics Applied sciences Computer science control theory systems Data processing. List processing. Character string processing Exact sciences and technology Memory organisation. Data processing Software Theoretical computing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	249
container_issue	1
container_start_page	240
container_title	Theoretical computer science
container_volume	352
creator	Huynh, Trinh N.D. Hon, Wing-Kai Lam, Tak-Wah Sung, Wing-Kin
description	Let T be a text of length n and P be a pattern of length m , both strings over a fixed finite alphabet A . The k -difference ( k -mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P . In this paper we investigate a well-studied case in which T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster. We give a solution using an O ( n log n ) bits indexing data structure with O ( \| A \| k m k · max ( k , log n ) + occ ) query time, where occ is the number of occurrences. The best previous result requires O ( n log n ) bits indexing data structure and gives O ( \| A \| k m k + 2 + occ ) query time. Our solution also allows us to exploit compressed suffix arrays to reduce the indexing space to O ( n ) bits, while increasing the query time by an O ( log n ) factor only.
doi_str_mv	10.1016/j.tcs.2005.11.022
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_28069572</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0304397505008716</els_id><sourcerecordid>28069572</sourcerecordid><originalsourceid>FETCH-LOGICAL-c401t-1f4420c96f7df7fb0bd5b2902fef45095add20e46cf3452e64130041496e65ff3</originalsourceid><addsrcrecordid>eNp9UD1PAzEMjRBIlMIPYOsC2x1OLsk1MFUVX1IlFpijNOdAqrZX4itq_z05tRIbHuw3vGf7PcauOZQcuL5blJ2nUgCokvMShDhhAz6uTSGEkadsABXIojK1OmcXRAvIpWo9YPeTzSa1u7hyHY6oS3H9OcrYf_VgS3337WqTkAibEW1DiLuRS8nt6ZKdBbckvDrOIft4enyfvhSzt-fX6WRWeAm8K3iQUoA3OtRNqMMc5o2aCwMiYJAKjHJNIwCl9qGSSqCWvAKQXBqNWoVQDdntYW9-9HuL1NlVJI_LpVtjuyUrxqCNqkUm8gPRp5YoYbCblI2lveVg-5TswuaUbJ-S5dzmlLLm5rjckXfLkNzaR_oT1koLwyHzHg48zE5_IiZLPuLaYxMT-s42bfznyi-j3Hyw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>28069572</pqid></control><display><type>article</type><title>Approximate string matching using compressed suffix arrays</title><source>Access via ScienceDirect (Elsevier)</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Huynh, Trinh N.D. ; Hon, Wing-Kai ; Lam, Tak-Wah ; Sung, Wing-Kin</creator><creatorcontrib>Huynh, Trinh N.D. ; Hon, Wing-Kai ; Lam, Tak-Wah ; Sung, Wing-Kin</creatorcontrib><description>Let T be a text of length n and P be a pattern of length m , both strings over a fixed finite alphabet A . The k -difference ( k -mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P . In this paper we investigate a well-studied case in which T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster. We give a solution using an O ( n log n ) bits indexing data structure with O ( \| A \| k m k · max ( k , log n ) + occ ) query time, where occ is the number of occurrences. The best previous result requires O ( n log n ) bits indexing data structure and gives O ( \| A \| k m k + 2 + occ ) query time. Our solution also allows us to exploit compressed suffix arrays to reduce the indexing space to O ( n ) bits, while increasing the query time by an O ( log n ) factor only.</description><identifier>ISSN: 0304-3975</identifier><identifier>EISSN: 1879-2294</identifier><identifier>DOI: 10.1016/j.tcs.2005.11.022</identifier><identifier>CODEN: TCSCDI</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Algorithmics. Computability. Computer arithmetics ; Applied sciences ; Computer science; control theory; systems ; Data processing. List processing. Character string processing ; Exact sciences and technology ; Memory organisation. Data processing ; Software ; Theoretical computing</subject><ispartof>Theoretical computer science, 2006-03, Vol.352 (1), p.240-249</ispartof><rights>2005 Elsevier B.V.</rights><rights>2006 INIST-CNRS</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c401t-1f4420c96f7df7fb0bd5b2902fef45095add20e46cf3452e64130041496e65ff3</citedby><cites>FETCH-LOGICAL-c401t-1f4420c96f7df7fb0bd5b2902fef45095add20e46cf3452e64130041496e65ff3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.tcs.2005.11.022$$EHTML$$P50$$Gelsevier$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=17562910$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Huynh, Trinh N.D.</creatorcontrib><creatorcontrib>Hon, Wing-Kai</creatorcontrib><creatorcontrib>Lam, Tak-Wah</creatorcontrib><creatorcontrib>Sung, Wing-Kin</creatorcontrib><title>Approximate string matching using compressed suffix arrays</title><title>Theoretical computer science</title><description>Let T be a text of length n and P be a pattern of length m , both strings over a fixed finite alphabet A . The k -difference ( k -mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P . In this paper we investigate a well-studied case in which T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster. We give a solution using an O ( n log n ) bits indexing data structure with O ( \| A \| k m k · max ( k , log n ) + occ ) query time, where occ is the number of occurrences. The best previous result requires O ( n log n ) bits indexing data structure and gives O ( \| A \| k m k + 2 + occ ) query time. Our solution also allows us to exploit compressed suffix arrays to reduce the indexing space to O ( n ) bits, while increasing the query time by an O ( log n ) factor only.</description><subject>Algorithmics. Computability. Computer arithmetics</subject><subject>Applied sciences</subject><subject>Computer science; control theory; systems</subject><subject>Data processing. List processing. Character string processing</subject><subject>Exact sciences and technology</subject><subject>Memory organisation. Data processing</subject><subject>Software</subject><subject>Theoretical computing</subject><issn>0304-3975</issn><issn>1879-2294</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2006</creationdate><recordtype>article</recordtype><recordid>eNp9UD1PAzEMjRBIlMIPYOsC2x1OLsk1MFUVX1IlFpijNOdAqrZX4itq_z05tRIbHuw3vGf7PcauOZQcuL5blJ2nUgCokvMShDhhAz6uTSGEkadsABXIojK1OmcXRAvIpWo9YPeTzSa1u7hyHY6oS3H9OcrYf_VgS3337WqTkAibEW1DiLuRS8nt6ZKdBbckvDrOIft4enyfvhSzt-fX6WRWeAm8K3iQUoA3OtRNqMMc5o2aCwMiYJAKjHJNIwCl9qGSSqCWvAKQXBqNWoVQDdntYW9-9HuL1NlVJI_LpVtjuyUrxqCNqkUm8gPRp5YoYbCblI2lveVg-5TswuaUbJ-S5dzmlLLm5rjckXfLkNzaR_oT1koLwyHzHg48zE5_IiZLPuLaYxMT-s42bfznyi-j3Hyw</recordid><startdate>20060307</startdate><enddate>20060307</enddate><creator>Huynh, Trinh N.D.</creator><creator>Hon, Wing-Kai</creator><creator>Lam, Tak-Wah</creator><creator>Sung, Wing-Kin</creator><general>Elsevier B.V</general><general>Elsevier</general><scope>6I.</scope><scope>AAFTH</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20060307</creationdate><title>Approximate string matching using compressed suffix arrays</title><author>Huynh, Trinh N.D. ; Hon, Wing-Kai ; Lam, Tak-Wah ; Sung, Wing-Kin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c401t-1f4420c96f7df7fb0bd5b2902fef45095add20e46cf3452e64130041496e65ff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2006</creationdate><topic>Algorithmics. Computability. Computer arithmetics</topic><topic>Applied sciences</topic><topic>Computer science; control theory; systems</topic><topic>Data processing. List processing. Character string processing</topic><topic>Exact sciences and technology</topic><topic>Memory organisation. Data processing</topic><topic>Software</topic><topic>Theoretical computing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Huynh, Trinh N.D.</creatorcontrib><creatorcontrib>Hon, Wing-Kai</creatorcontrib><creatorcontrib>Lam, Tak-Wah</creatorcontrib><creatorcontrib>Sung, Wing-Kin</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Theoretical computer science</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huynh, Trinh N.D.</au><au>Hon, Wing-Kai</au><au>Lam, Tak-Wah</au><au>Sung, Wing-Kin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Approximate string matching using compressed suffix arrays</atitle><jtitle>Theoretical computer science</jtitle><date>2006-03-07</date><risdate>2006</risdate><volume>352</volume><issue>1</issue><spage>240</spage><epage>249</epage><pages>240-249</pages><issn>0304-3975</issn><eissn>1879-2294</eissn><coden>TCSCDI</coden><abstract>Let T be a text of length n and P be a pattern of length m , both strings over a fixed finite alphabet A . The k -difference ( k -mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P . In this paper we investigate a well-studied case in which T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster. We give a solution using an O ( n log n ) bits indexing data structure with O ( \| A \| k m k · max ( k , log n ) + occ ) query time, where occ is the number of occurrences. The best previous result requires O ( n log n ) bits indexing data structure and gives O ( \| A \| k m k + 2 + occ ) query time. Our solution also allows us to exploit compressed suffix arrays to reduce the indexing space to O ( n ) bits, while increasing the query time by an O ( log n ) factor only.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.tcs.2005.11.022</doi><tpages>10</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0304-3975
ispartof	Theoretical computer science, 2006-03, Vol.352 (1), p.240-249
issn	0304-3975 1879-2294
language	eng
recordid	cdi_proquest_miscellaneous_28069572
source	Access via ScienceDirect (Elsevier); EZB-FREE-00999 freely available EZB journals
subjects	Algorithmics. Computability. Computer arithmetics Applied sciences Computer science control theory systems Data processing. List processing. Character string processing Exact sciences and technology Memory organisation. Data processing Software Theoretical computing
title	Approximate string matching using compressed suffix arrays
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T02%3A13%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Approximate%20string%20matching%20using%20compressed%20suffix%20arrays&rft.jtitle=Theoretical%20computer%20science&rft.au=Huynh,%20Trinh%20N.D.&rft.date=2006-03-07&rft.volume=352&rft.issue=1&rft.spage=240&rft.epage=249&rft.pages=240-249&rft.issn=0304-3975&rft.eissn=1879-2294&rft.coden=TCSCDI&rft_id=info:doi/10.1016/j.tcs.2005.11.022&rft_dat=%3Cproquest_cross%3E28069572%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=28069572&rft_id=info:pmid/&rft_els_id=S0304397505008716&rfr_iscdi=true