Optimal String Mining Under Frequency Constraints

We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging stri...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Fischer, Johannes, Heun, Volker, Kramer, Stefan
Format:	Buchkapitel
Sprache:	eng
Schlagworte:	Applied sciences Biological and medical sciences Computer science control theory systems Data processing. List processing. Character string processing Exact sciences and technology Frequency Constraint Fundamental and applied biological sciences. Psychology General aspects Index Structure Information systems. Data bases Lexicographic Order Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Memory organisation. Data processing Pattern Domain Secondary Memory Software
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ2-test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffix- and lcp-arrays, and a new preprocessing scheme for range minimum queries. The advantages of array-based data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on real-world data from computational biology and demonstrate that the approach also works well in practice.
ISSN:	0302-9743 1611-3349
DOI:	10.1007/11871637_17