Utilizing global and path information with language modelling for hierarchical text classification

Hierarchical text classification of a Web taxonomy is challenging because it is a very large-scale problem with hundreds of thousands of categories and associated documents. Furthermore, the conceptual levels and training data availabilities of categories vary widely. The narrow-down approach is the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of information science 2014-04, Vol.40 (2), p.127-145
Hauptverfasser:	Oh, Heung-Seon, Myaeng, Sung-Hyon
Format:	Artikel
Sprache:	eng
Schlagworte:	Categories Classification Classifiers Construction Effectiveness studies Exact sciences and technology General aspects Information and communication sciences Information science Information science. Documentation Mathematical models Modelling Sciences and techniques of general use Taxonomy Text Text categorization Texts Vocabularies & taxonomies Web sites
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Hierarchical text classification of a Web taxonomy is challenging because it is a very large-scale problem with hundreds of thousands of categories and associated documents. Furthermore, the conceptual levels and training data availabilities of categories vary widely. The narrow-down approach is the state of the art; it utilizes a search engine for generating candidates from the taxonomy and builds a classifier for the final category selection. In this paper, we take the same approach but address the issue of using global information in a language modelling framework to improve effectiveness. We propose three methods of using non-local information for the task: a passive way of utilizing global information for smoothing; an aggressive way where a top-level classifier is built and integrated with a local model; and a method of using label terms associated with the path from a category to the root, which is based on our systematic observation that they are underrepresented in the documents. For evaluation, we constructed a document collection from Web pages in the Open Directory Project. A series of experiments and their results show the superiority of our methods and reveal the role of global information in hierarchical text classification.
ISSN:	0165-5515 1741-6485
DOI:	10.1177/0165551513507415