Computing Minimal Absent Words and Extended Bispecial Factors with CDAWG Space
A string \(w\) is said to be a minimal absent word (MAW) for a string \(S\) if \(w\) does not occur in \(S\) and any proper substring of \(w\) occurs in \(S\). We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-05 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A string \(w\) is said to be a minimal absent word (MAW) for a string \(S\) if \(w\) does not occur in \(S\) and any proper substring of \(w\) occurs in \(S\). We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size \(\Theta(n)\) that can output the set \(\mathsf{MAW}(S)\) of all MAWs for a given string \(S\) of length \(n\) in \(O(n + |\mathsf{MAW}(S)|)\) time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output \(\mathsf{MAW}(S)\) in \(O(|\mathsf{MAW}(S)|)\) time with \(O(\mathsf{e}_\min)\) space, where \(\mathsf{e}_\min\) denotes the minimum of the sizes of the CDAWGs for \(S\) and for its reversal \(S^R\). For any strings of length \(n\), it holds that \(\mathsf{e}_\min < 2n\), and for highly repetitive strings \(\mathsf{e}_\min\) can be sublinear (up to logarithmic) in \(n\). We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG. |
---|---|
ISSN: | 2331-8422 |