Motif statistics
We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers “motifs” widely used in computational biology. Our approach is based on: (i) classical constructive results in automata and formal language theory; (ii) analytic co...
Gespeichert in:
Veröffentlicht in: | Theoretical computer science 2002-09, Vol.287 (2), p.593-617 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers “motifs” widely used in computational biology. Our approach is based on: (i) classical constructive results in automata and formal language theory; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra in order to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulæ
that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes amino acid database P
RODOM. We handled more than 88% of the standard collection of P
ROSITE motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted. |
---|---|
ISSN: | 0304-3975 1879-2294 |
DOI: | 10.1016/S0304-3975(01)00264-X |