Approximate string matching using compressed suffix arrays

Let T be a text of length n and P be a pattern of length m , both strings over a fixed finite alphabet A . The k -difference ( k -mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P . In this paper we investig...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Theoretical computer science 2006-03, Vol.352 (1), p.240-249
Hauptverfasser: Huynh, Trinh N.D., Hon, Wing-Kai, Lam, Tak-Wah, Sung, Wing-Kin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Let T be a text of length n and P be a pattern of length m , both strings over a fixed finite alphabet A . The k -difference ( k -mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P . In this paper we investigate a well-studied case in which T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster. We give a solution using an O ( n log n ) bits indexing data structure with O ( | A | k m k · max ( k , log n ) + occ ) query time, where occ is the number of occurrences. The best previous result requires O ( n log n ) bits indexing data structure and gives O ( | A | k m k + 2 + occ ) query time. Our solution also allows us to exploit compressed suffix arrays to reduce the indexing space to O ( n ) bits, while increasing the query time by an O ( log n ) factor only.
ISSN:0304-3975
1879-2294
DOI:10.1016/j.tcs.2005.11.022