Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index

Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Cell systems 2018-08, Vol.7 (2), p.201-207.e4
Hauptverfasser:	Pandey, Prashant, Almodaresi, Fatemeh, Bender, Michael A., Ferdman, Michael, Johnson, Rob, Patro, Rob
Format:	Artikel
Sprache:	eng
Schlagworte:	Animals Bloom filter color equivalence classes counting quotient filter Databases, Genetic de Bruijn graph experiment discovery Humans Mantis RNA - genetics RNA sequencing Sequence Analysis, RNA - economics Sequence Analysis, RNA - methods sequence Bloom tree sequence search Software Time Factors Transcriptome
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6–108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days. [Display omitted] •Mantis is a tool to search through large collections of raw sequencing experiments•Mantis index is 20% smaller than the Split-Sequence Bloom Tree (SSBT) search index•Mantis index is 6x faster to build and 6–100× faster to query than the SSBT•Mantis index is exact; query results contain no false-positives or -negatives Mantis is a system to index and search through large collections of raw sequencing data. The query sequence can be a known or newly assembled gene or any valid nucleotide sequence. Mantis is faster and smaller than existing sequence-search tools and is exact in the sense that it does not report false-positives. To construct the index, Mantis indexes the k-mers (substrings of size k) in the reads of an experiment and then groups k-mers across experiments that exhibit the same patterns of occurrence.
ISSN:	2405-4712 2405-4720
DOI:	10.1016/j.cels.2018.05.021