Algebras for Querying Text Regions: Expressive Power and Optimization

There is a significant amount of interest in combining and extending database and information retrieval technologies to manage textual data. The challenge is becoming more relevant due to increased availability of documents in digital form. Document data has a natural hierarchical structure, which m...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computer and system sciences 1998-12, Vol.57 (3), p.272-288
Hauptverfasser: Consens, Mariano P., Milo, Tova
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:There is a significant amount of interest in combining and extending database and information retrieval technologies to manage textual data. The challenge is becoming more relevant due to increased availability of documents in digital form. Document data has a natural hierarchical structure, which may be made explicit due to the use of markup conventions (as with SGML). An important aspect of managing structured and semistructured textual data consists of supporting the efficient retrieval of text components based both on their content and on their structure. In this paper we study issues related to the expressive power and optimization of a class of algebras that support combining string (or pattern) searches with queries on the hierarchical structure of the text. Theregion algebrastudied is a set-at-a-time algebra for manipulatingtext regions(substrings of the text) that supports finding out nesting and ordering properties of the text regions. This algebra is part of the language in use in commercial text retrieval systems and can form the basis for supporting SQL-like access to textual data. By presenting a close relationship between the region algebra and the monadic first order theory of finite binary trees, we show that queries in the algebra can be optimized, in the sense that equivalence to less expensive expressions can be tested. This optimization can be difficult (co-NP-hard in the general case), but there is an important class of queries that can be optimized in polynomial time. On the negative side, we show that the language is incapable of capturing some important properties of the text structure, related to the nesting and ordering of text regions. We conclude by suggesting possible extensions to increase the expressive power of the language and consider one such example.
ISSN:0022-0000
1090-2724
DOI:10.1006/jcss.1998.1564