VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses

In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2020-03, Vol.32 (3), p.602-616
Hauptverfasser: Liu, Xianying, Zhu, Qiang, Pramanik, Sakti, Brown, C. Titus, Qian, Gang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 616
container_issue 3
container_start_page 602
container_title IEEE transactions on knowledge and data engineering
container_volume 32
creator Liu, Xianying
Zhu, Qiang
Pramanik, Sakti
Brown, C. Titus
Qian, Gang
description In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k 0 -mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k ≠ k 0 ). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.
doi_str_mv 10.1109/TKDE.2018.2885952
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_webofscience_primary_000526526700014</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8573155</ieee_id><sourcerecordid>2352189460</sourcerecordid><originalsourceid>FETCH-LOGICAL-c336t-fcfaf2bb01deeebce3970fd4c4fdd675f84b1abc42db1adbdaad88e699993eda3</originalsourceid><addsrcrecordid>eNqNkVtP3DAQhSNEJS7lB6C-WOpjla1vSRzetgulqCtVYoHXyLHHYLTEqe2U8u872yCesSzNyD6fdea4KE4ZXTBG2683P88vFpwyteBKVW3F94pDVlWq5Kxl-9hTyUopZHNQHKX0SClVjWKHxfZuWW5yiHBGluTOxzzpLVmOYwx__ZPOQP5fzifaPJAcyGYaxxCzH-7JNYyQffZ_gHzz9-RcZ038QC5hCE-Iwu8JBoP0oLcvCdLH4oPT2wQnr_W4uP1-cbP6Ua5_XV6tluvSCFHn0hmnHe97yiwA9AZE21BnpZHO2rqpnJI9072R3GK1vdXaKgV1i0uA1eK4-Dy_i6bRQsrdY5gimkgdFxVnqpU1RRWbVSaGlCK4bow4c3zpGO12oXa7ULtdqN1rqMh8mZln6INLxu_me-Mw1YrXuBvsmES1er965bPOPgyrMA0Z0U8z6jGCN0RVjcBfFf8A_UOWGQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2352189460</pqid></control><display><type>article</type><title>VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses</title><source>IEEE Electronic Library (IEL)</source><creator>Liu, Xianying ; Zhu, Qiang ; Pramanik, Sakti ; Brown, C. Titus ; Qian, Gang</creator><creatorcontrib>Liu, Xianying ; Zhu, Qiang ; Pramanik, Sakti ; Brown, C. Titus ; Qian, Gang</creatorcontrib><description>In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k 0 -mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k ≠ k 0 ). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2018.2885952</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>LOS ALAMITOS: IEEE</publisher><subject>algorithms for data and knowledge management ; Big Data ; Bioinformatics ; Bioinformatics (genome or protein) databases ; Computer Science ; Computer Science, Artificial Intelligence ; Computer Science, Information Systems ; Data analysis ; data storage representations ; Datasets ; Engineering ; Engineering, Electrical &amp; Electronic ; Genomes ; Genomics ; Model accuracy ; Optimization ; Queries ; Query processing ; Science &amp; Technology ; Search problems ; Sequences ; Sequential analysis ; Technology</subject><ispartof>IEEE transactions on knowledge and data engineering, 2020-03, Vol.32 (3), p.602-616</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>true</woscitedreferencessubscribed><woscitedreferencescount>3</woscitedreferencescount><woscitedreferencesoriginalsourcerecordid>wos000526526700014</woscitedreferencesoriginalsourcerecordid><citedby>FETCH-LOGICAL-c336t-fcfaf2bb01deeebce3970fd4c4fdd675f84b1abc42db1adbdaad88e699993eda3</citedby><cites>FETCH-LOGICAL-c336t-fcfaf2bb01deeebce3970fd4c4fdd675f84b1abc42db1adbdaad88e699993eda3</cites><orcidid>0000-0002-5658-5875 ; 0000-0001-7094-9236</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8573155$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>315,781,785,797,27929,27930,28253,54763</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8573155$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Liu, Xianying</creatorcontrib><creatorcontrib>Zhu, Qiang</creatorcontrib><creatorcontrib>Pramanik, Sakti</creatorcontrib><creatorcontrib>Brown, C. Titus</creatorcontrib><creatorcontrib>Qian, Gang</creatorcontrib><title>VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><addtitle>IEEE T KNOWL DATA EN</addtitle><description>In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k 0 -mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k ≠ k 0 ). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.</description><subject>algorithms for data and knowledge management</subject><subject>Big Data</subject><subject>Bioinformatics</subject><subject>Bioinformatics (genome or protein) databases</subject><subject>Computer Science</subject><subject>Computer Science, Artificial Intelligence</subject><subject>Computer Science, Information Systems</subject><subject>Data analysis</subject><subject>data storage representations</subject><subject>Datasets</subject><subject>Engineering</subject><subject>Engineering, Electrical &amp; Electronic</subject><subject>Genomes</subject><subject>Genomics</subject><subject>Model accuracy</subject><subject>Optimization</subject><subject>Queries</subject><subject>Query processing</subject><subject>Science &amp; Technology</subject><subject>Search problems</subject><subject>Sequences</subject><subject>Sequential analysis</subject><subject>Technology</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><sourceid>AOWDO</sourceid><recordid>eNqNkVtP3DAQhSNEJS7lB6C-WOpjla1vSRzetgulqCtVYoHXyLHHYLTEqe2U8u872yCesSzNyD6fdea4KE4ZXTBG2683P88vFpwyteBKVW3F94pDVlWq5Kxl-9hTyUopZHNQHKX0SClVjWKHxfZuWW5yiHBGluTOxzzpLVmOYwx__ZPOQP5fzifaPJAcyGYaxxCzH-7JNYyQffZ_gHzz9-RcZ038QC5hCE-Iwu8JBoP0oLcvCdLH4oPT2wQnr_W4uP1-cbP6Ua5_XV6tluvSCFHn0hmnHe97yiwA9AZE21BnpZHO2rqpnJI9072R3GK1vdXaKgV1i0uA1eK4-Dy_i6bRQsrdY5gimkgdFxVnqpU1RRWbVSaGlCK4bow4c3zpGO12oXa7ULtdqN1rqMh8mZln6INLxu_me-Mw1YrXuBvsmES1er965bPOPgyrMA0Z0U8z6jGCN0RVjcBfFf8A_UOWGQ</recordid><startdate>20200301</startdate><enddate>20200301</enddate><creator>Liu, Xianying</creator><creator>Zhu, Qiang</creator><creator>Pramanik, Sakti</creator><creator>Brown, C. Titus</creator><creator>Qian, Gang</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AOWDO</scope><scope>BLEPL</scope><scope>DTL</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-5658-5875</orcidid><orcidid>https://orcid.org/0000-0001-7094-9236</orcidid></search><sort><creationdate>20200301</creationdate><title>VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses</title><author>Liu, Xianying ; Zhu, Qiang ; Pramanik, Sakti ; Brown, C. Titus ; Qian, Gang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c336t-fcfaf2bb01deeebce3970fd4c4fdd675f84b1abc42db1adbdaad88e699993eda3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>algorithms for data and knowledge management</topic><topic>Big Data</topic><topic>Bioinformatics</topic><topic>Bioinformatics (genome or protein) databases</topic><topic>Computer Science</topic><topic>Computer Science, Artificial Intelligence</topic><topic>Computer Science, Information Systems</topic><topic>Data analysis</topic><topic>data storage representations</topic><topic>Datasets</topic><topic>Engineering</topic><topic>Engineering, Electrical &amp; Electronic</topic><topic>Genomes</topic><topic>Genomics</topic><topic>Model accuracy</topic><topic>Optimization</topic><topic>Queries</topic><topic>Query processing</topic><topic>Science &amp; Technology</topic><topic>Search problems</topic><topic>Sequences</topic><topic>Sequential analysis</topic><topic>Technology</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liu, Xianying</creatorcontrib><creatorcontrib>Zhu, Qiang</creatorcontrib><creatorcontrib>Pramanik, Sakti</creatorcontrib><creatorcontrib>Brown, C. Titus</creatorcontrib><creatorcontrib>Qian, Gang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Web of Science - Science Citation Index Expanded - 2020</collection><collection>Web of Science Core Collection</collection><collection>Science Citation Index Expanded</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Xianying</au><au>Zhu, Qiang</au><au>Pramanik, Sakti</au><au>Brown, C. Titus</au><au>Qian, Gang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><stitle>IEEE T KNOWL DATA EN</stitle><date>2020-03-01</date><risdate>2020</risdate><volume>32</volume><issue>3</issue><spage>602</spage><epage>616</epage><pages>602-616</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using k-mers (i.e., subsequences of length k) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k 0 -mers) and supports multiple virtual stores for other portions of the dataset (i.e., k-mers with k ≠ k 0 ). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.</abstract><cop>LOS ALAMITOS</cop><pub>IEEE</pub><doi>10.1109/TKDE.2018.2885952</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-5658-5875</orcidid><orcidid>https://orcid.org/0000-0001-7094-9236</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1041-4347
ispartof IEEE transactions on knowledge and data engineering, 2020-03, Vol.32 (3), p.602-616
issn 1041-4347
1558-2191
language eng
recordid cdi_webofscience_primary_000526526700014
source IEEE Electronic Library (IEL)
subjects algorithms for data and knowledge management
Big Data
Bioinformatics
Bioinformatics (genome or protein) databases
Computer Science
Computer Science, Artificial Intelligence
Computer Science, Information Systems
Data analysis
data storage representations
Datasets
Engineering
Engineering, Electrical & Electronic
Genomes
Genomics
Model accuracy
Optimization
Queries
Query processing
Science & Technology
Search problems
Sequences
Sequential analysis
Technology
title VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T00%3A40%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=VA-Store:%20A%20Virtual%20Approximate%20Store%20Approach%20to%20Supporting%20Repetitive%20Big%20Data%20in%20Genome%20Sequence%20Analyses&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Liu,%20Xianying&rft.date=2020-03-01&rft.volume=32&rft.issue=3&rft.spage=602&rft.epage=616&rft.pages=602-616&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2018.2885952&rft_dat=%3Cproquest_RIE%3E2352189460%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2352189460&rft_id=info:pmid/&rft_ieee_id=8573155&rfr_iscdi=true