SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark

The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype-phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Genes 2020-01, Vol.11 (1), p.53, Article 53
Hauptverfasser:	Al-Ars, Zaid, Wang, Saiyi, Mushtaq, Hamid
Format:	Artikel
Sprache:	eng
Schlagworte:	Best practice Big Data Computer applications Genetics & Heredity Genomes Genotypes Life Sciences & Biomedicine Optimization techniques Phenotypes Ribonucleic acid RNA Science & Technology
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype-phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4x, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7x compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2x faster while achieving the same accuracy of the results.
ISSN:	2073-4425 2073-4425
DOI:	10.3390/genes11010053