A PHRASE GRAMMAR-BASED CONCEPTUAL INDEXING PARADIGM
Present day information retrieval systems largely ignore the issues of lexical and compositional semantics, and rely mainly on some statistical measures for choosing or evolving an indexing scheme. This has been the reason for the decreasing precision in their responses, given an exponentially incre...
Gespeichert in:
Veröffentlicht in: | Applied artificial intelligence 2005-07, Vol.19 (6), p.559-599 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Present day information retrieval systems largely ignore the issues of lexical and compositional semantics, and rely mainly on some statistical measures for choosing or evolving an indexing scheme. This has been the reason for the decreasing precision in their responses, given an exponentially increasing number of Web pages. The work reported in this paper addresses this issue from a linguistic point of view. We show that the detection of domain-specific phrases can capture the task-specific semantics of documents. We introduce the notion of n*-gram formalism to characterize the domain-specific phrases and their variants, taking a few sample domains. A method to construct a phrase grammar from a small set of documents is proposed. A method of conceptual indexing based on the phrase grammar has also been proposed. In order to demonstrate the effectiveness of the proposed method, we have designed a versatile system that can perform concept-based retrieval, in addition to several document-processing tasks, such as text classification, extraction-based summarization, context tracking, and semantic tagging. Collectively, the system can address the semantic content of documents. Considering the fact that an average user prefers highly relevant results in the top-ranked subset to an exhaustively retrieved set, it is shown that the proposed system performs better in that it retrieves documents that are more conceptually relevant than those retrieved by Google, and at 95% confidence level. |
---|---|
ISSN: | 0883-9514 1087-6545 |
DOI: | 10.1080/08839510590901958 |