Large-scale gene sequence lossless parallel compression method based on Spark

The invention discloses a large-scale gene sequence lossless parallel compression method based on Spark, which comprises the following steps of: preprocessing a reference sequence and a sequence to be compressed: extracting a basic base sequence of the reference sequence and constructing a matching...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	FANG HOUZHI, JI YIMU, YAO HAICHANG, HU GUANGYONG, PENG JIANHUA, ZHANG YIFAN
Format:	Patent
Sprache:	chi ; eng
Schlagworte:	INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTEDFOR SPECIFIC APPLICATION FIELDS PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The invention discloses a large-scale gene sequence lossless parallel compression method based on Spark, which comprises the following steps of: preprocessing a reference sequence and a sequence to be compressed: extracting a basic base sequence of the reference sequence and constructing a matching index by a main node; sending the basic base sequence of the reference sequence and the matching index of the basic base sequence to all working nodes in the form of compressed broadcast variables; each working node extracts a basic base sequence of a sequence to be compressed in parallel and creates RDD, and sequence auxiliary information is independently coded and stored; and finally obtaining a compressed file through a first parallel matching step, a second matching index construction step and a second parallel matching step. According to the method, large-scale gene secondary iteration compression and the characteristics of Spark based on a memory distribution data set are fully combined, and compared with oth