Suffix rank: a new scalable algorithm for indexing large string collections

We investigate the problem of building a suffix array substring index for inputs significantly larger than main memory. This problem is especially important in the context of biological sequence analysis, where biological polymers can be thought of as very large contiguous strings. The objective is...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the VLDB Endowment 2020-08, Vol.13 (12), p.2787-2800
Hauptverfasser: Barsky, Marina, Gabor, Jonathan, Consens, Mariano P., Thomo, Alex
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We investigate the problem of building a suffix array substring index for inputs significantly larger than main memory. This problem is especially important in the context of biological sequence analysis, where biological polymers can be thought of as very large contiguous strings. The objective is to index every substring of these long strings to facilitate efficient queries. We propose a new simple, scalable, and inherently parallelizable algorithm for building a suffix array for out-of-core strings. Our new algorithm, Suffix Rank , scales to arbitrarily large inputs, using disk as a memory extension. It solves the problem in just O (log n ) scans over the disk-resident data. We evaluate the practical performance of our new algorithm, and show that for inputs significantly larger than the available amount of RAM, it scales better than other state-of-the-art solutions, such as eSAIS, SAscan , and eGSA.
ISSN:2150-8097
2150-8097
DOI:10.14778/3407790.3407861