Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index

Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Cell systems 2018-08, Vol.7 (2), p.201-207.e4
Hauptverfasser: Pandey, Prashant, Almodaresi, Fatemeh, Bender, Michael A., Ferdman, Michael, Johnson, Rob, Patro, Rob
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 207.e4
container_issue 2
container_start_page 201
container_title Cell systems
container_volume 7
creator Pandey, Prashant
Almodaresi, Fatemeh
Bender, Michael A.
Ferdman, Michael
Johnson, Rob
Patro, Rob
description Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6–108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days. [Display omitted] •Mantis is a tool to search through large collections of raw sequencing experiments•Mantis index is 20% smaller than the Split-Sequence Bloom Tree (SSBT) search index•Mantis index is 6x faster to build and 6–100× faster to query than the SSBT•Mantis index is exact; query results contain no false-positives or -negatives Mantis is a system to index and search through large collections of raw sequencing data. The query sequence can be a known or newly assembled gene or any valid nucleotide sequence. Mantis is faster and smaller than existing sequence-search tools and is exact in the sense that it does not report false-positives. To construct the index, Mantis indexes the k-mers (substrings of size k) in the reads of an experiment and then groups k-mers across experiments that exhibit the same patterns of occurrence.
doi_str_mv 10.1016/j.cels.2018.05.021
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10964368</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S2405471218302394</els_id><sourcerecordid>2059040925</sourcerecordid><originalsourceid>FETCH-LOGICAL-c483t-de9d54b2084a47191c7fcf952dcbbcc667095a39a19e2695f18e67f4238651763</originalsourceid><addsrcrecordid>eNp9UcFO3DAQtRAVIMoP9ICinjiQMHZsJ66QEEVQkLbqYduz5XUmrFdZh9peRP--jhZW5dKTx573nt_MI-QThYoClReryuIQKwa0rUBUwOgeOWIcRMkbBvu7mrJDchLjCgAoV_mRHZBDplQtaSuOyNfvxicXvxTXxZ2J6byYr80wnBfGd8Xti7GpmJnwiOXcmgGLOf7eoLf5iibYZfHgO3z5SD70Zoh48noek193tz9v7svZj28PN9ez0vK2TmWHqhN8waDlJttS1Da97ZVgnV0srJWyASVMrQxVyKQSPW1RNj1ndSsFbWR9TK62uk-bxRo7iz4FM-in4NYm_NGjcfp9x7ulfhyfNQUleS3brPB5qzDG5HS0LqFd2tF7tElTIRmtRQadvX4TxjxtTHrtYt71YDyOm6gZCAUcFJugbAu1YYwxYL8zQ0FPKemVnlLSU0oahM4pZdLpv2PsKG-ZZMDlFpCZ-OwwTE6nrXcuTEa70f1P_y_fKqCT</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2059040925</pqid></control><display><type>article</type><title>Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index</title><source>MEDLINE</source><source>EZB-FREE-00999 freely available EZB journals</source><source>Alma/SFX Local Collection</source><creator>Pandey, Prashant ; Almodaresi, Fatemeh ; Bender, Michael A. ; Ferdman, Michael ; Johnson, Rob ; Patro, Rob</creator><creatorcontrib>Pandey, Prashant ; Almodaresi, Fatemeh ; Bender, Michael A. ; Ferdman, Michael ; Johnson, Rob ; Patro, Rob</creatorcontrib><description>Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6–108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days. [Display omitted] •Mantis is a tool to search through large collections of raw sequencing experiments•Mantis index is 20% smaller than the Split-Sequence Bloom Tree (SSBT) search index•Mantis index is 6x faster to build and 6–100× faster to query than the SSBT•Mantis index is exact; query results contain no false-positives or -negatives Mantis is a system to index and search through large collections of raw sequencing data. The query sequence can be a known or newly assembled gene or any valid nucleotide sequence. Mantis is faster and smaller than existing sequence-search tools and is exact in the sense that it does not report false-positives. To construct the index, Mantis indexes the k-mers (substrings of size k) in the reads of an experiment and then groups k-mers across experiments that exhibit the same patterns of occurrence.</description><identifier>ISSN: 2405-4712</identifier><identifier>EISSN: 2405-4720</identifier><identifier>DOI: 10.1016/j.cels.2018.05.021</identifier><identifier>PMID: 29936185</identifier><language>eng</language><publisher>United States: Elsevier Inc</publisher><subject>Animals ; Bloom filter ; color equivalence classes ; counting quotient filter ; Databases, Genetic ; de Bruijn graph ; experiment discovery ; Humans ; Mantis ; RNA - genetics ; RNA sequencing ; Sequence Analysis, RNA - economics ; Sequence Analysis, RNA - methods ; sequence Bloom tree ; sequence search ; Software ; Time Factors ; Transcriptome</subject><ispartof>Cell systems, 2018-08, Vol.7 (2), p.201-207.e4</ispartof><rights>2018 Elsevier Inc.</rights><rights>Copyright © 2018 Elsevier Inc. All rights reserved.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c483t-de9d54b2084a47191c7fcf952dcbbcc667095a39a19e2695f18e67f4238651763</citedby><cites>FETCH-LOGICAL-c483t-de9d54b2084a47191c7fcf952dcbbcc667095a39a19e2695f18e67f4238651763</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,315,781,785,886,27929,27930</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29936185$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink><backlink>$$Uhttps://www.osti.gov/biblio/1562135$$D View this record in Osti.gov$$Hfree_for_read</backlink></links><search><creatorcontrib>Pandey, Prashant</creatorcontrib><creatorcontrib>Almodaresi, Fatemeh</creatorcontrib><creatorcontrib>Bender, Michael A.</creatorcontrib><creatorcontrib>Ferdman, Michael</creatorcontrib><creatorcontrib>Johnson, Rob</creatorcontrib><creatorcontrib>Patro, Rob</creatorcontrib><title>Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index</title><title>Cell systems</title><addtitle>Cell Syst</addtitle><description>Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6–108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days. [Display omitted] •Mantis is a tool to search through large collections of raw sequencing experiments•Mantis index is 20% smaller than the Split-Sequence Bloom Tree (SSBT) search index•Mantis index is 6x faster to build and 6–100× faster to query than the SSBT•Mantis index is exact; query results contain no false-positives or -negatives Mantis is a system to index and search through large collections of raw sequencing data. The query sequence can be a known or newly assembled gene or any valid nucleotide sequence. Mantis is faster and smaller than existing sequence-search tools and is exact in the sense that it does not report false-positives. To construct the index, Mantis indexes the k-mers (substrings of size k) in the reads of an experiment and then groups k-mers across experiments that exhibit the same patterns of occurrence.</description><subject>Animals</subject><subject>Bloom filter</subject><subject>color equivalence classes</subject><subject>counting quotient filter</subject><subject>Databases, Genetic</subject><subject>de Bruijn graph</subject><subject>experiment discovery</subject><subject>Humans</subject><subject>Mantis</subject><subject>RNA - genetics</subject><subject>RNA sequencing</subject><subject>Sequence Analysis, RNA - economics</subject><subject>Sequence Analysis, RNA - methods</subject><subject>sequence Bloom tree</subject><subject>sequence search</subject><subject>Software</subject><subject>Time Factors</subject><subject>Transcriptome</subject><issn>2405-4712</issn><issn>2405-4720</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNp9UcFO3DAQtRAVIMoP9ICinjiQMHZsJ66QEEVQkLbqYduz5XUmrFdZh9peRP--jhZW5dKTx573nt_MI-QThYoClReryuIQKwa0rUBUwOgeOWIcRMkbBvu7mrJDchLjCgAoV_mRHZBDplQtaSuOyNfvxicXvxTXxZ2J6byYr80wnBfGd8Xti7GpmJnwiOXcmgGLOf7eoLf5iibYZfHgO3z5SD70Zoh48noek193tz9v7svZj28PN9ez0vK2TmWHqhN8waDlJttS1Da97ZVgnV0srJWyASVMrQxVyKQSPW1RNj1ndSsFbWR9TK62uk-bxRo7iz4FM-in4NYm_NGjcfp9x7ulfhyfNQUleS3brPB5qzDG5HS0LqFd2tF7tElTIRmtRQadvX4TxjxtTHrtYt71YDyOm6gZCAUcFJugbAu1YYwxYL8zQ0FPKemVnlLSU0oahM4pZdLpv2PsKG-ZZMDlFpCZ-OwwTE6nrXcuTEa70f1P_y_fKqCT</recordid><startdate>20180822</startdate><enddate>20180822</enddate><creator>Pandey, Prashant</creator><creator>Almodaresi, Fatemeh</creator><creator>Bender, Michael A.</creator><creator>Ferdman, Michael</creator><creator>Johnson, Rob</creator><creator>Patro, Rob</creator><general>Elsevier Inc</general><general>Elsevier</general><scope>6I.</scope><scope>AAFTH</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>OTOTI</scope><scope>5PM</scope></search><sort><creationdate>20180822</creationdate><title>Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index</title><author>Pandey, Prashant ; Almodaresi, Fatemeh ; Bender, Michael A. ; Ferdman, Michael ; Johnson, Rob ; Patro, Rob</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c483t-de9d54b2084a47191c7fcf952dcbbcc667095a39a19e2695f18e67f4238651763</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Animals</topic><topic>Bloom filter</topic><topic>color equivalence classes</topic><topic>counting quotient filter</topic><topic>Databases, Genetic</topic><topic>de Bruijn graph</topic><topic>experiment discovery</topic><topic>Humans</topic><topic>Mantis</topic><topic>RNA - genetics</topic><topic>RNA sequencing</topic><topic>Sequence Analysis, RNA - economics</topic><topic>Sequence Analysis, RNA - methods</topic><topic>sequence Bloom tree</topic><topic>sequence search</topic><topic>Software</topic><topic>Time Factors</topic><topic>Transcriptome</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Pandey, Prashant</creatorcontrib><creatorcontrib>Almodaresi, Fatemeh</creatorcontrib><creatorcontrib>Bender, Michael A.</creatorcontrib><creatorcontrib>Ferdman, Michael</creatorcontrib><creatorcontrib>Johnson, Rob</creatorcontrib><creatorcontrib>Patro, Rob</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>OSTI.GOV</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Cell systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pandey, Prashant</au><au>Almodaresi, Fatemeh</au><au>Bender, Michael A.</au><au>Ferdman, Michael</au><au>Johnson, Rob</au><au>Patro, Rob</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index</atitle><jtitle>Cell systems</jtitle><addtitle>Cell Syst</addtitle><date>2018-08-22</date><risdate>2018</risdate><volume>7</volume><issue>2</issue><spage>201</spage><epage>207.e4</epage><pages>201-207.e4</pages><issn>2405-4712</issn><eissn>2405-4720</eissn><abstract>Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6–108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days. [Display omitted] •Mantis is a tool to search through large collections of raw sequencing experiments•Mantis index is 20% smaller than the Split-Sequence Bloom Tree (SSBT) search index•Mantis index is 6x faster to build and 6–100× faster to query than the SSBT•Mantis index is exact; query results contain no false-positives or -negatives Mantis is a system to index and search through large collections of raw sequencing data. The query sequence can be a known or newly assembled gene or any valid nucleotide sequence. Mantis is faster and smaller than existing sequence-search tools and is exact in the sense that it does not report false-positives. To construct the index, Mantis indexes the k-mers (substrings of size k) in the reads of an experiment and then groups k-mers across experiments that exhibit the same patterns of occurrence.</abstract><cop>United States</cop><pub>Elsevier Inc</pub><pmid>29936185</pmid><doi>10.1016/j.cels.2018.05.021</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2405-4712
ispartof Cell systems, 2018-08, Vol.7 (2), p.201-207.e4
issn 2405-4712
2405-4720
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10964368
source MEDLINE; EZB-FREE-00999 freely available EZB journals; Alma/SFX Local Collection
subjects Animals
Bloom filter
color equivalence classes
counting quotient filter
Databases, Genetic
de Bruijn graph
experiment discovery
Humans
Mantis
RNA - genetics
RNA sequencing
Sequence Analysis, RNA - economics
Sequence Analysis, RNA - methods
sequence Bloom tree
sequence search
Software
Time Factors
Transcriptome
title Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-14T00%3A57%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mantis:%20A%20Fast,%20Small,%20and%20Exact%20Large-Scale%20Sequence-Search%20Index&rft.jtitle=Cell%20systems&rft.au=Pandey,%20Prashant&rft.date=2018-08-22&rft.volume=7&rft.issue=2&rft.spage=201&rft.epage=207.e4&rft.pages=201-207.e4&rft.issn=2405-4712&rft.eissn=2405-4720&rft_id=info:doi/10.1016/j.cels.2018.05.021&rft_dat=%3Cproquest_pubme%3E2059040925%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2059040925&rft_id=info:pmid/29936185&rft_els_id=S2405471218302394&rfr_iscdi=true