Taming DNA clustering in massive datasets with SLYMFAST

Data from sequencing instruments are produced at such rates that their analysis is becoming increasingly computationally challenging. Although DNA sequence clustering of very large datasets is an important computational step in various bioinformatics applications, it is a performanceintensive task t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied computing review : a publication of the Special Interest Group on Applied Computing 2022-03, Vol.22 (1), p.15-23
Hauptverfasser:	Belcaid, Mahdi, Arisdakessian, Cedric, Kravchenko, Yuliia
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Data from sequencing instruments are produced at such rates that their analysis is becoming increasingly computationally challenging. Although DNA sequence clustering of very large datasets is an important computational step in various bioinformatics applications, it is a performanceintensive task that often cannot be completed without compromising sensitivity for speed. In order to optimize CPU and RAM usage in DNA clustering, we propose a probabilistic, rigorous, and efficient technique to partition a large DNA sequence dataset into smaller, non-overlapping subsets, which can then be analyzed using more precise clustering algorithms. The approach results in a significant reduction in CPU and RAM requirements, as well as a more intuitive parallelization of the clustering step. We show in our results that our algorithm, implemented in a program called SLYMFAST, can cluster in just a few hours datasets that would otherwise take weeks to cluster without partitioning first.
ISSN:	1559-6915 1931-0161
DOI:	10.1145/3530043.3530045