SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets

This paper presents SeQual, a scalable tool to efficiently perform quality control of large genomic datasets. Our tool currently supports more than 30 different operations (e.g., filtering, trimming, formatting) that can be applied to DNA/RNA reads in FASTQ/FASTA formats to improve subsequent downst...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2020-01, Vol.8, p.1-1
Hauptverfasser:	Exposito, Roberto R., Galego-Torreiro, Roi, Gonzalez-Dominguez, Jorge
Format:	Artikel
Sprache:	eng
Schlagworte:	Acceleration Apache Spark Big Data Bioinformatics Clusters Datasets Deoxyribonucleic acid Distributed memory DNA Massive data points Next-Generation Sequencing (NGS) Quality control Sequential analysis Sparks
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper presents SeQual, a scalable tool to efficiently perform quality control of large genomic datasets. Our tool currently supports more than 30 different operations (e.g., filtering, trimming, formatting) that can be applied to DNA/RNA reads in FASTQ/FASTA formats to improve subsequent downstream analyses, while providing a simple and user-friendly graphical interface for non-expert users. Furthermore, SeQual takes full advantage of Big Data technologies to process massive datasets on distributed-memory systems such as clusters by relying on the open-source Apache Spark cluster computing framework. Our scalable Spark-based implementation allows to reduce the runtime from more than three hours to less than 20 minutes when processing a paired-end dataset with 251 million reads per input file on an 8-node multi-core cluster.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2020.3015016