Simrank: Rapid and sensitive general-purpose k-mer search tool

BACKGROUND: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence d...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	BMC Ecology 2011-04, Vol.11 (1), p.11-11
Hauptverfasser:	DeSantis, Todd Z, Keller, Keith, Karaoz, Ulas, Alekseyenko, Alexander V, Singh, Navjeet NS, Brodie, Eoin L, Pei, Zhiheng, Andersen, Gary L, Larsen, Niels
Format:	Artikel
Sprache:	eng
Schlagworte:	ALGORITHMS Analysis CLASSIFICATION Computational Biology COMPUTER CALCULATIONS computer software DATA ANALYSIS Databases, Bibliographic Databases, Factual DNA ecologists ENVIRONMENTAL SCIENCES GEOSCIENCES GRAPH THEORY INFORMATION RETRIEVAL Internet/Web search services Molecular Biology Nucleotide sequence PERFORMANCE TESTING Physiological aspects PROTEINS Public software RNA Software Technology application trees
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	BACKGROUND: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. RESULTS: Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. CONCLUSIONS: Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity.
ISSN:	1472-6785 1472-6785
DOI:	10.1186/1472-6785-11-11