Using heterogeneous linguistic knowledge in local coherence identification for information retrieval

This paper proposes a novel approach to automatic text segmentation without a full semantic understanding. In order to analyse the linguistic bonds and determine the degree of coherence that a text may exhibit, the tremendous diversity of textual relations in a discourse network is represented. A co...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of information science 2000-01, Vol.26 (5), p.313-328
1. Verfasser: Chan, Samuel W.K.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper proposes a novel approach to automatic text segmentation without a full semantic understanding. In order to analyse the linguistic bonds and determine the degree of coherence that a text may exhibit, the tremendous diversity of textual relations in a discourse network is represented. A corpus of mutual linguistic knowledge that captures the similarity of meaning and causal relations is encoded in the discourse network, which is then subjected to a cluster algorithm. As a result, segments in the text are segregated into clusters according to their textual similarity. Topic boundaries in a text can be identified by observing the shifts of segments from one cluster to another. The experimental results show that the combination of the heterogeneous knowledge is capable of addressing the topic shifts. Comparison with a related method demonstrates that the algorithm is closely related to the topic boundaries. Given the increasing recognition of text structure in the fields of information retrieval in unpartitioned text, this approach provides a quantitative model and an efficient tool in text segmentation.
ISSN:0165-5515
1741-6485
DOI:10.1177/016555150002600504