ParStream-seq: An improved method of handling next generation sequence data

The exponential growth of next generation sequencing (NGS) data has put forward the challenge for its storage as well as its efficient and faster analysis. Storing the entire amount of data for a particular experiment and its alignment to the reference genome is an essential step for any quantitativ...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Genomics (San Diego, Calif.) Calif.), 2019-12, Vol.111 (6), p.1641-1650
Hauptverfasser: Mondal, Sudip, Maji, Ranjan Kumar, Ghosh, Zhumur, Khatua, Sunirmal
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The exponential growth of next generation sequencing (NGS) data has put forward the challenge for its storage as well as its efficient and faster analysis. Storing the entire amount of data for a particular experiment and its alignment to the reference genome is an essential step for any quantitative analysis of NGS data. Here, we introduce streaming access technique ‘ParStream-seq’ that splits the bulk sequence data, accessed from a remote repository into short manageable packets followed by executing their alignment process in parallel in each of the compute core. The optimal packet size with fixed number of reads is determined in the stream that maximizes system utilization. Result shows a reduction in the execution time and improvement in the memory footprint. Overall, this streaming access technique provides means to overcome the hurdle of storing the entire volume of sequence data corresponding to a particular experiment, prior to its analysis. •Introduced a sequence streaming protocol called ParStream-seq which splits and access the bulk sequence data.•To overcome the hurdle of storing the entire volume of NGS data corresponding to a particular experiment.•Determined the ‘optimal packet size’ with a fixed number of reads and optimal number of sequence splits for streaming.•It facilitates parallel execution of alignment process and stores the results in the Hadoop Distributed File system (HDFS).
ISSN:0888-7543
1089-8646
DOI:10.1016/j.ygeno.2018.11.014