Taxonomy generation for text segments: A practical web-based approach

It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed taxonomy. In this article, we address the problem of taxonomy generation for diverse text segments with a general and practical approach that uses the W...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on information systems 2005-10, Vol.23 (4), p.363-396
Hauptverfasser: Chuang, Shui-Lung, Chien, Lee-Feng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed taxonomy. In this article, we address the problem of taxonomy generation for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then designed for creating the hierarchical topic structure of text segments. Text segments with close concepts can be grouped together in a cluster, and relevant clusters linked at the same or near levels. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithm tries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the potential of the proposed approach, which provides a basis for the in-depth analysis of text segments on a larger scale and is believed able to benefit many information systems.
ISSN:1046-8188
1558-2868
DOI:10.1145/1095872.1095873