Clustering Heterogeneous Web Data using Clustering by Compression. Cluster Validity

The expansive nature of the Internet produced a vast quantity of unstructured data, compared to our conception of a conventional data base. The application of clustering on the World Wide Web is essential to get structured information from this sea of information. In this paper, we intend to test th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Cernian, A., Carstoiu, D., Olteanu, A.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The expansive nature of the Internet produced a vast quantity of unstructured data, compared to our conception of a conventional data base. The application of clustering on the World Wide Web is essential to get structured information from this sea of information. In this paper, we intend to test the results of a new clustering technique - clustering by compression - when applied to heterogeneous sets of data. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). In order to validate the results, we calculate some quality indices. If the values we obtain prove a high quality of the clustering, in the near future we plan to include the clustering by compression technique into a framework for clustering heterogeneous Web objects.
DOI:10.1109/SYNASC.2008.64