Similarity Model and Term Association for Document Categorization

In the information retrieval and document categorization context, both Euclidean distance- and cosine-based similarity models are based on the assumption that term vectors are orthogonal. But this assumption is not true. Term associations are ignored in such similarity models. This paper analyzes th...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Kou, Huaizhong, Gardarin, Georges
Format:	Buchkapitel
Sprache:	eng
Schlagworte:	Axis Vector Content analysis Document Categorization Exact sciences and technology Feature Term Indexing. Classification. Abstracting. Syntheses Information and document structure and analysis Information processing and retrieval Information science. Documentation Sciences and techniques of general use Similarity Model Term Vector
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In the information retrieval and document categorization context, both Euclidean distance- and cosine-based similarity models are based on the assumption that term vectors are orthogonal. But this assumption is not true. Term associations are ignored in such similarity models. This paper analyzes the properties of term-document space, term-category space and categorydocument space. Then, without the assumption of term independence, we propose a new mathematical model to estimate the association between terms and define a ∞-similarity model of documents. Here we make best use of existing category membership represented by corpus as much as possible, and the objective is to improve categorization performance. The empirical results been obtained by k-NN classifier over Reuters-21578 corpus show that utilization of term association can improve the effectiveness of categorization system and ∞- similarity model outperforms than ones without term association.
ISSN:	0302-9743 1611-3349
DOI:	10.1007/3-540-36271-1_22