Reducing storage requirements for biological sequence comparison

Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the ‘seed-and-extend’ approach, in which occurrences of short subsequences called ‘seeds’ are used to search for potentially longer matches in a l...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Bioinformatics 2004-12, Vol.20 (18), p.3363-3369
Hauptverfasser:	Roberts, Michael, Hayes, Wayne, Hunt, Brian R., Mount, Stephen M., Yorke, James A.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Biological and medical sciences Databases, Genetic Fundamental and applied biological sciences. Psychology General aspects Information Storage and Retrieval - methods Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Numerical Analysis, Computer-Assisted Sequence Alignment - methods Sequence Analysis - methods
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the ‘seed-and-extend’ approach, in which occurrences of short subsequences called ‘seeds’ are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process. Results: We present a simple and elegant method in which only a small fraction of seeds, called ‘minimizers’, needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.
ISSN:	1367-4803 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/bth408