Constructing suffix tree for gigabyte sequences with megabyte memory

Mammalian genomes are typically 3 Gbps (gibabase pairs) in size. The largest public database NCBI (National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)) of DNA contains more than 20 Gbps. Suffix trees are widely acknowledged as a data structure to support exact/approximate seq...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2005-01, Vol.17 (1), p.90-105
Hauptverfasser: Cheung, Ching-Fung, Yu, Jeffrey Xu, Lu, Hongjun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Mammalian genomes are typically 3 Gbps (gibabase pairs) in size. The largest public database NCBI (National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)) of DNA contains more than 20 Gbps. Suffix trees are widely acknowledged as a data structure to support exact/approximate sequence matching queries as well as repetitive structure finding efficiently when they can reside in main memory. But, it has been shown as difficult to handle long DNA sequences using suffix trees due to the so-called memory bottleneck problems. The most space efficient main-memory suffix tree construction algorithm takes nine hours and 45 GB memory space to index the human genome [S. Kurtz (1999)]. We show that suffix trees for long DNA sequences can be efficiently constructed on disk using small bounded main memory space and, therefore, all existing algorithms based on suffix trees can be used to handle long DNA sequences that cannot be held in main memory. We adopt a two-phase strategy to construct a suffix tree on disk: 1) to construct a diskbase suffix-tree without suffix links and 2) rebuild suffix links upon the suffix-tree being constructed on disk, if needed. We propose a new disk-based suffix tree construction algorithm, called DynaCluster, which shows O(nlogn) experimental behavior regarding CPU cost and linearity for I/O cost. DynaCluster needs 16 MB main memory only to construct more than 200 Mbps DNA sequences and significantly outperforms the existing disk-based suffix-tree construction algorithms using prepartitioning techniques in terms of both construction cost and query processing cost. We conducted extensive performance studies and report our findings in this paper.
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2005.3