Document Clustering based on Topic Maps

Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2011-12
Hauptverfasser:	Rafi, Muhammad, M Shahid Shaikh, Farooq, Amir
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Clustering Computer Science - Artificial Intelligence Computer Science - Information Retrieval Decision support systems Information retrieval Mathematical models Representations Semantics Similarity Similarity measures Suffix trees
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Rafi, Muhammad M Shahid Shaikh Farooq, Amir
description	Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algorithms, which are based on either Document Vector model (DVM) or Suffix Tree model (STC), are less efficient in producing results with high cluster quality. This paper introduces a new approach for document clustering based on the Topic Map representation of the documents. The document is being transformed into a compact form. A similarity measure is proposed based upon the inferred information through topic maps data and structures. The suggested method is implemented using agglomerative hierarchal clustering and tested on standard Information retrieval (IR) datasets. The comparative experiment reveals that the proposed approach is effective in improving the cluster quality.
doi_str_mv	10.48550/arxiv.1112.6219
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_1112_6219</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2086910486</sourcerecordid><originalsourceid>FETCH-LOGICAL-a516-17fdffb2dc66ffd59ab6962c4963b9e9b2f52fc27679926a6a88f230c38b77c13</originalsourceid><addsrcrecordid>eNotj89LwzAYhoMgbMzdd5KCB0-tyZfmS3KU-hMmXnovSZpIx9bWpBX9792cp_fy8PI8hGwYLUolBL0z8bv7KhhjUCAwfUGWwDnLVQmwIOuUdpRSQAlC8CW5fRjcfPD9lFX7OU0-dv1HZk3ybTb0WT2MncvezJiuyGUw--TX_7si9dNjXb3k2_fn1-p-mxvBMGcytCFYaB1iCK3QxqJGcKVGbrXXFoKA4ECi1BrQoFEqAKeOKyulY3xFrs-3fxHNGLuDiT_NKaY5xRyBmzMwxuFz9mlqdsMc-6NSA1ShZrRUyH8B9nJK4w</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2086910486</pqid></control><display><type>article</type><title>Document Clustering based on Topic Maps</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Rafi, Muhammad ; M Shahid Shaikh ; Farooq, Amir</creator><creatorcontrib>Rafi, Muhammad ; M Shahid Shaikh ; Farooq, Amir</creatorcontrib><description>Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algorithms, which are based on either Document Vector model (DVM) or Suffix Tree model (STC), are less efficient in producing results with high cluster quality. This paper introduces a new approach for document clustering based on the Topic Map representation of the documents. The document is being transformed into a compact form. A similarity measure is proposed based upon the inferred information through topic maps data and structures. The suggested method is implemented using agglomerative hierarchal clustering and tested on standard Information retrieval (IR) datasets. The comparative experiment reveals that the proposed approach is effective in improving the cluster quality.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.1112.6219</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Clustering ; Computer Science - Artificial Intelligence ; Computer Science - Information Retrieval ; Decision support systems ; Information retrieval ; Mathematical models ; Representations ; Semantics ; Similarity ; Similarity measures ; Suffix trees</subject><ispartof>arXiv.org, 2011-12</ispartof><rights>2011. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,780,881,27904</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.1112.6219$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.5120/1640-2204$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Rafi, Muhammad</creatorcontrib><creatorcontrib>M Shahid Shaikh</creatorcontrib><creatorcontrib>Farooq, Amir</creatorcontrib><title>Document Clustering based on Topic Maps</title><title>arXiv.org</title><description>Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algorithms, which are based on either Document Vector model (DVM) or Suffix Tree model (STC), are less efficient in producing results with high cluster quality. This paper introduces a new approach for document clustering based on the Topic Map representation of the documents. The document is being transformed into a compact form. A similarity measure is proposed based upon the inferred information through topic maps data and structures. The suggested method is implemented using agglomerative hierarchal clustering and tested on standard Information retrieval (IR) datasets. The comparative experiment reveals that the proposed approach is effective in improving the cluster quality.</description><subject>Algorithms</subject><subject>Clustering</subject><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Information Retrieval</subject><subject>Decision support systems</subject><subject>Information retrieval</subject><subject>Mathematical models</subject><subject>Representations</subject><subject>Semantics</subject><subject>Similarity</subject><subject>Similarity measures</subject><subject>Suffix trees</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2011</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotj89LwzAYhoMgbMzdd5KCB0-tyZfmS3KU-hMmXnovSZpIx9bWpBX9792cp_fy8PI8hGwYLUolBL0z8bv7KhhjUCAwfUGWwDnLVQmwIOuUdpRSQAlC8CW5fRjcfPD9lFX7OU0-dv1HZk3ybTb0WT2MncvezJiuyGUw--TX_7si9dNjXb3k2_fn1-p-mxvBMGcytCFYaB1iCK3QxqJGcKVGbrXXFoKA4ECi1BrQoFEqAKeOKyulY3xFrs-3fxHNGLuDiT_NKaY5xRyBmzMwxuFz9mlqdsMc-6NSA1ShZrRUyH8B9nJK4w</recordid><startdate>20111229</startdate><enddate>20111229</enddate><creator>Rafi, Muhammad</creator><creator>M Shahid Shaikh</creator><creator>Farooq, Amir</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20111229</creationdate><title>Document Clustering based on Topic Maps</title><author>Rafi, Muhammad ; M Shahid Shaikh ; Farooq, Amir</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a516-17fdffb2dc66ffd59ab6962c4963b9e9b2f52fc27679926a6a88f230c38b77c13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Algorithms</topic><topic>Clustering</topic><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Information Retrieval</topic><topic>Decision support systems</topic><topic>Information retrieval</topic><topic>Mathematical models</topic><topic>Representations</topic><topic>Semantics</topic><topic>Similarity</topic><topic>Similarity measures</topic><topic>Suffix trees</topic><toplevel>online_resources</toplevel><creatorcontrib>Rafi, Muhammad</creatorcontrib><creatorcontrib>M Shahid Shaikh</creatorcontrib><creatorcontrib>Farooq, Amir</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Database (Proquest)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Databases</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Rafi, Muhammad</au><au>M Shahid Shaikh</au><au>Farooq, Amir</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Document Clustering based on Topic Maps</atitle><jtitle>arXiv.org</jtitle><date>2011-12-29</date><risdate>2011</risdate><eissn>2331-8422</eissn><abstract>Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algorithms, which are based on either Document Vector model (DVM) or Suffix Tree model (STC), are less efficient in producing results with high cluster quality. This paper introduces a new approach for document clustering based on the Topic Map representation of the documents. The document is being transformed into a compact form. A similarity measure is proposed based upon the inferred information through topic maps data and structures. The suggested method is implemented using agglomerative hierarchal clustering and tested on standard Information retrieval (IR) datasets. The comparative experiment reveals that the proposed approach is effective in improving the cluster quality.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.1112.6219</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2011-12
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_1112_6219
source	arXiv.org; Free E- Journals
subjects	Algorithms Clustering Computer Science - Artificial Intelligence Computer Science - Information Retrieval Decision support systems Information retrieval Mathematical models Representations Semantics Similarity Similarity measures Suffix trees
title	Document Clustering based on Topic Maps
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T18%3A36%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Document%20Clustering%20based%20on%20Topic%20Maps&rft.jtitle=arXiv.org&rft.au=Rafi,%20Muhammad&rft.date=2011-12-29&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.1112.6219&rft_dat=%3Cproquest_arxiv%3E2086910486%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2086910486&rft_id=info:pmid/&rfr_iscdi=true