ParStream-seq: An improved method of handling next generation sequence data

The exponential growth of next generation sequencing (NGS) data has put forward the challenge for its storage as well as its efficient and faster analysis. Storing the entire amount of data for a particular experiment and its alignment to the reference genome is an essential step for any quantitativ...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Genomics (San Diego, Calif.) Calif.), 2019-12, Vol.111 (6), p.1641-1650
Hauptverfasser:	Mondal, Sudip, Maji, Ranjan Kumar, Ghosh, Zhumur, Khatua, Sunirmal
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Alignment Biological big data HDFS High-Throughput Nucleotide Sequencing NGS Parallel computing Sequence Alignment Sequence Analysis, DNA Software Streaming
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The exponential growth of next generation sequencing (NGS) data has put forward the challenge for its storage as well as its efficient and faster analysis. Storing the entire amount of data for a particular experiment and its alignment to the reference genome is an essential step for any quantitative analysis of NGS data. Here, we introduce streaming access technique ‘ParStream-seq’ that splits the bulk sequence data, accessed from a remote repository into short manageable packets followed by executing their alignment process in parallel in each of the compute core. The optimal packet size with fixed number of reads is determined in the stream that maximizes system utilization. Result shows a reduction in the execution time and improvement in the memory footprint. Overall, this streaming access technique provides means to overcome the hurdle of storing the entire volume of sequence data corresponding to a particular experiment, prior to its analysis. •Introduced a sequence streaming protocol called ParStream-seq which splits and access the bulk sequence data.•To overcome the hurdle of storing the entire volume of NGS data corresponding to a particular experiment.•Determined the ‘optimal packet size’ with a fixed number of reads and optimal number of sequence splits for streaming.•It facilitates parallel execution of alignment process and stores the results in the Hadoop Distributed File system (HDFS).
ISSN:	0888-7543 1089-8646
DOI:	10.1016/j.ygeno.2018.11.014