Computing Minimal Absent Words and Extended Bispecial Factors with CDAWG Space

A string \(w\) is said to be a minimal absent word (MAW) for a string \(S\) if \(w\) does not occur in \(S\) and any proper substring of \(w\) occurs in \(S\). We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-05
Hauptverfasser: Inenaga, Shunsuke, Mieno, Takuya, Arimura, Hiroki, Funakoshi, Mitsuru, Fujishige, Yuta
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A string \(w\) is said to be a minimal absent word (MAW) for a string \(S\) if \(w\) does not occur in \(S\) and any proper substring of \(w\) occurs in \(S\). We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size \(\Theta(n)\) that can output the set \(\mathsf{MAW}(S)\) of all MAWs for a given string \(S\) of length \(n\) in \(O(n + |\mathsf{MAW}(S)|)\) time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output \(\mathsf{MAW}(S)\) in \(O(|\mathsf{MAW}(S)|)\) time with \(O(\mathsf{e}_\min)\) space, where \(\mathsf{e}_\min\) denotes the minimum of the sizes of the CDAWGs for \(S\) and for its reversal \(S^R\). For any strings of length \(n\), it holds that \(\mathsf{e}_\min < 2n\), and for highly repetitive strings \(\mathsf{e}_\min\) can be sublinear (up to logarithmic) in \(n\). We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG.
ISSN:2331-8422