A new efficient referential genome compression technique for FastQ files

Hospitals and medical laboratories create a tremendous amount of genome sequence data every day for use in research, surgery, and illness diagnosis. To make storage comprehensible, compression is therefore essential for the storage, monitoring, and distribution of all these data. A novel data compre...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Functional & integrative genomics 2023-12, Vol.23 (4), p.333-333, Article 333
Hauptverfasser:	Kumar, Sanjeev, Singh, Mukund Pratap, Nayak, Soumya Ranjan, Khan, Asif Uddin, Jain, Anuj Kumar, Singh, Prabhishek, Diwakar, Manoj, Soujanya, Thota
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Animal Genetics and Genomics Biochemistry Bioinformatics Biomedical and Life Sciences Cell Biology Compression Data compression Data Compression - methods data quality Decompression Gene mapping Genome Genomes genomics High-Throughput Nucleotide Sequencing - methods Life Sciences Medical laboratories Microbial Genetics and Genomics Nucleotide sequence nucleotide sequences Original Article Palindromes Plant Genetics and Genomics Sequence Analysis, DNA - methods Software streams surgery
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Hospitals and medical laboratories create a tremendous amount of genome sequence data every day for use in research, surgery, and illness diagnosis. To make storage comprehensible, compression is therefore essential for the storage, monitoring, and distribution of all these data. A novel data compression technique is required to reduce the time as well as the cost of storage, transmission, and data processing. General-purpose compression techniques do not perform so well for these data due to their special features: a large number of repeats (tandem and palindrome), small alphabets, and highly similar, and specific file formats. In this study, we provide a method for compressing FastQ files that uses a reference genome as a backup without sacrificing data quality. FastQ files are initially split into three streams (identifier, sequence, and quality score), each of which receives its own compression technique. A novel quick and lightweight mapping mechanism is also presented to effectively compress the sequence stream. As shown by experiments, the suggested methods, both the compression ratio and the compression/decompression duration of NGS data compressed using RBFQC, are superior to those achieved by other state-of-the-art genome compression methods. In comparison to GZIP, RBFQC may achieve a compression ratio of 80–140% for fixed-length datasets and 80–125% for variable-length datasets. Compared to domain-specific FastQ file referential genome compression techniques, RBFQC has a compression and decompression speed (total) improvement of 10–25%.
ISSN:	1438-793X 1438-7948
DOI:	10.1007/s10142-023-01259-x