Suffix rank: a new scalable algorithm for indexing large string collections
We investigate the problem of building a suffix array substring index for inputs significantly larger than main memory. This problem is especially important in the context of biological sequence analysis, where biological polymers can be thought of as very large contiguous strings. The objective is...
Gespeichert in:
Veröffentlicht in: | Proceedings of the VLDB Endowment 2020-08, Vol.13 (12), p.2787-2800 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We investigate the problem of building a suffix array substring index for inputs significantly larger than main memory. This problem is especially important in the context of biological sequence analysis, where biological polymers can be thought of as very large contiguous strings. The objective is to index every substring of these long strings to facilitate efficient queries. We propose a new simple, scalable, and inherently parallelizable algorithm for building a suffix array for out-of-core strings. Our new algorithm,
Suffix Rank
, scales to arbitrarily large inputs, using disk as a memory extension. It solves the problem in just
O
(log
n
) scans over the disk-resident data. We evaluate the practical performance of our new algorithm, and show that for inputs significantly larger than the available amount of RAM, it scales better than other state-of-the-art solutions, such as
eSAIS, SAscan
, and
eGSA. |
---|---|
ISSN: | 2150-8097 2150-8097 |
DOI: | 10.14778/3407790.3407861 |