Distributed RMI-DBG model: Scalable iterative de Bruijn graph algorithm for short read genome assembly problem

Genome assembly is the computational process of merging short parts of DNA into larger sequences called contigs. Rapid growth of high-throughput genome sequencing technologies and production of large amount of data have led to the genome assembly paradigms shift from shared memory to distributed mem...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2023-12, Vol.233, p.120859, Article 120859
Hauptverfasser: Zare Hosseini, Zeinab, Kolahdouz Rahimi, Shekoufeh, Forouzan, Esmaeil, Baraani, Ahmad
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Genome assembly is the computational process of merging short parts of DNA into larger sequences called contigs. Rapid growth of high-throughput genome sequencing technologies and production of large amount of data have led to the genome assembly paradigms shift from shared memory to distributed memory systems in the recent years. Among the existing assembly algorithms, the iterative de Bruijn Graph is a leading approach for assembling short reads. This approach by exploring the advantages of all k between kmin to kmax, generates high quality assembly. However, the assembly operations are decelerated especially in the larger data sets. RMI-DBG is an agile iterative de Bruijn Graph algorithm that has the computational efficiency of de Bruijn Graph methods and the flexibility of overlap-based algorithms. In this paper, we suggest a distributed iterative DBG model based on RMI-DBG, named DRMI-DBG. The proposed idea is to address the problem of parallelizing the de Bruijn Graph construction and processing on distributed memory systems at each iteration of the algorithm. DRMI-DBG is a scalable iterative DBG framework over a Hadoop cluster by applying the power of Spark (a batch processing engine) and Giraph (a distributed big graph processing system). Experiments on a variety of real data sets show that DRMI-DBG accelerates the performance of RMI-DBG algorithm and IDBA-UD assembler up to 4.8 times with comparable or better results in the quality of the assembly. For more evaluation, performance of the proposed model is compared to ScalaDBG, as the state-of-the-art distributed assembler based on the multiple k-values strategy.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2023.120859