Compression and approximate matching

A population of sequences is called non-random if there is a statistical model and an associated compression algorithm that allows members of the population to be compressed, on average. Any available statistical model of a population should be incorporated into algorithms for alignment of the seque...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computer journal 1999, Vol.42 (1), p.1-10
Hauptverfasser:	ALLISON, L, POWELL, D, DIX, T. I
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Biological and medical sciences Computers Costs Data compression Fundamental and applied biological sciences. Psychology Miscellaneous Molecular and cellular biology Molecular genetics Mutation Population
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A population of sequences is called non-random if there is a statistical model and an associated compression algorithm that allows members of the population to be compressed, on average. Any available statistical model of a population should be incorporated into algorithms for alignment of the sequences and doing so changes the rank order of possible alignments in general. The model should also be used in deciding if a resulting approximate match between two sequences is significant or not. It is shown how to do this for two plausible interpretations involving pairs of sequences that might or might not be related. Efficient alignment algorithms are described for quite general statistical models of sequences. The new alignment algorithms are more sensitive to what might be termed 'features' of the sequences. A natural significance test is shown to he rarely fooled by apparent similarities between two sequences that are merely typical of all or most members of the population, even unrelated members.
ISSN:	0010-4620 1460-2067
DOI:	10.1093/comjnl/42.1.1