Large-scale gene sequence lossless parallel compression method based on Spark

The invention discloses a large-scale gene sequence lossless parallel compression method based on Spark, which comprises the following steps of: preprocessing a reference sequence and a sequence to be compressed: extracting a basic base sequence of the reference sequence and constructing a matching...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: FANG HOUZHI, JI YIMU, YAO HAICHANG, HU GUANGYONG, PENG JIANHUA, ZHANG YIFAN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a large-scale gene sequence lossless parallel compression method based on Spark, which comprises the following steps of: preprocessing a reference sequence and a sequence to be compressed: extracting a basic base sequence of the reference sequence and constructing a matching index by a main node; sending the basic base sequence of the reference sequence and the matching index of the basic base sequence to all working nodes in the form of compressed broadcast variables; each working node extracts a basic base sequence of a sequence to be compressed in parallel and creates RDD, and sequence auxiliary information is independently coded and stored; and finally obtaining a compressed file through a first parallel matching step, a second matching index construction step and a second parallel matching step. According to the method, large-scale gene secondary iteration compression and the characteristics of Spark based on a memory distribution data set are fully combined, and compared with oth