Document Clustering for Social Problem Detection and Cluster Evaluation Measures

Document clustering is one of the useful approaches for macro analysis of the large scale of documents. However it is difficult for an analyst to efficiently detects clusters which contain important information from the results of document clustering. This paper presents a method to support an analy...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Transactions of the Japanese Society for Artificial Intelligence 2009, Vol.24(4), pp.333-338
Hauptverfasser: Hashimoto, Taiichi, Murakami, Koji, Inui, Takashi, Utsumi, Kazuo, Ishikawa, Masamichi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Document clustering is one of the useful approaches for macro analysis of the large scale of documents. However it is difficult for an analyst to efficiently detects clusters which contain important information from the results of document clustering. This paper presents a method to support an analysis of social problems from newspaper articles. We define two new measures for each cluster to discover important clusters from a dendrogram generated by hierarchical clustering algorithm. One, called ``Density'', is a measure of relevance among documents in a cluster, and is calculated from the rate of terms shared within a cluster. The other, called ``Centrality'', is a measure of relevance among clusters, and is calculated from the depth of an ancestor node shared by arbitray two clusters in a dendrogram and the number of documents in the clusters. The measures are an extension of the conventional research in the field of co-word analysis in science and technology literature. We carried out experiments to evaluate our method using the Nikkei newspaper articles which describe the organizational hazards caused by Japanese industries. The experimental results showed that our method efficiently provided useful information to detect important clusters from a dendrogram.
ISSN:1346-0714
1346-8030
DOI:10.1527/tjsai.24.333