VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses

In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2020-03, Vol.32 (3), p.602-616
Hauptverfasser: Liu, Xianying, Zhu, Qiang, Pramanik, Sakti, Brown, C. Titus, Qian, Gang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k 0 -mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k ≠ k 0 ). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2018.2885952