On ontology-driven document clustering using core semantic features

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of f...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Knowledge and information systems 2011-08, Vol.28 (2), p.395-421
Hauptverfasser:	Fodeh, Samah, Punch, Bill, Tan, Pang-Ning
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Cluster analysis Clustering Clusters Computer Science Computer viruses Data Mining and Knowledge Discovery Database Management Datasets Document management Documents Empirical analysis Gain Information Storage and Retrieval Information systems Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Learning Malware Medicine Ontology Regular Paper Semantics Studies Texts
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	421
container_issue	2
container_start_page	395
container_title	Knowledge and information systems
container_volume	28
creator	Fodeh, Samah Punch, Bill Tan, Pang-Ning
description	Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.
doi_str_mv	10.1007/s10115-010-0370-4
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_914620609</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2420205101</sourcerecordid><originalsourceid>FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</originalsourceid><addsrcrecordid>eNp1kE1LxDAQhoMouK7-AG_Fi6foTNKP9CiLX7CwFz2HNJkuXdpmTVph_71duiAIXmbm8Lwvw8PYLcIDAhSPEQEx44DAQRbA0zO2AIEll4j5-elGWRSX7CrGHQAWOeKCrTZ94vvBt3574C4039Qnztuxo35IbDvGgULTb5MxHqf1gZJInemHxiY1mWEMFK_ZRW3aSDenvWSfL88fqze-3ry-r57W3Mq0GLhzIgOqrZHSWCdMQViVVGe1qjGvbFqmRjhyUmaQO2coU7KqnFJphYgglFyy-7l3H_zXSHHQXRMtta3pyY9Rl5jmAnIoJ_LuD7nzY-in57RSmJdKiGyCcIZs8DEGqvU-NJ0JB42gj1L1LFVPUvVRqk6njJgzcX_UQuG3-P_QD8_Xelg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>881698225</pqid></control><display><type>article</type><title>On ontology-driven document clustering using core semantic features</title><source>SpringerLink Journals - AutoHoldings</source><creator>Fodeh, Samah ; Punch, Bill ; Tan, Pang-Ning</creator><creatorcontrib>Fodeh, Samah ; Punch, Bill ; Tan, Pang-Ning</creatorcontrib><description>Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.</description><identifier>ISSN: 0219-1377</identifier><identifier>EISSN: 0219-3116</identifier><identifier>DOI: 10.1007/s10115-010-0370-4</identifier><identifier>CODEN: KISNCR</identifier><language>eng</language><publisher>London: Springer-Verlag</publisher><subject>Algorithms ; Cluster analysis ; Clustering ; Clusters ; Computer Science ; Computer viruses ; Data Mining and Knowledge Discovery ; Database Management ; Datasets ; Document management ; Documents ; Empirical analysis ; Gain ; Information Storage and Retrieval ; Information systems ; Information Systems and Communication Service ; Information Systems Applications (incl.Internet) ; IT in Business ; Learning ; Malware ; Medicine ; Ontology ; Regular Paper ; Semantics ; Studies ; Texts</subject><ispartof>Knowledge and information systems, 2011-08, Vol.28 (2), p.395-421</ispartof><rights>Springer-Verlag London Limited 2011</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</citedby><cites>FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10115-010-0370-4$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10115-010-0370-4$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27923,27924,41487,42556,51318</link.rule.ids></links><search><creatorcontrib>Fodeh, Samah</creatorcontrib><creatorcontrib>Punch, Bill</creatorcontrib><creatorcontrib>Tan, Pang-Ning</creatorcontrib><title>On ontology-driven document clustering using core semantic features</title><title>Knowledge and information systems</title><addtitle>Knowl Inf Syst</addtitle><description>Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.</description><subject>Algorithms</subject><subject>Cluster analysis</subject><subject>Clustering</subject><subject>Clusters</subject><subject>Computer Science</subject><subject>Computer viruses</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Datasets</subject><subject>Document management</subject><subject>Documents</subject><subject>Empirical analysis</subject><subject>Gain</subject><subject>Information Storage and Retrieval</subject><subject>Information systems</subject><subject>Information Systems and Communication Service</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>IT in Business</subject><subject>Learning</subject><subject>Malware</subject><subject>Medicine</subject><subject>Ontology</subject><subject>Regular Paper</subject><subject>Semantics</subject><subject>Studies</subject><subject>Texts</subject><issn>0219-1377</issn><issn>0219-3116</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2011</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1kE1LxDAQhoMouK7-AG_Fi6foTNKP9CiLX7CwFz2HNJkuXdpmTVph_71duiAIXmbm8Lwvw8PYLcIDAhSPEQEx44DAQRbA0zO2AIEll4j5-elGWRSX7CrGHQAWOeKCrTZ94vvBt3574C4039Qnztuxo35IbDvGgULTb5MxHqf1gZJInemHxiY1mWEMFK_ZRW3aSDenvWSfL88fqze-3ry-r57W3Mq0GLhzIgOqrZHSWCdMQViVVGe1qjGvbFqmRjhyUmaQO2coU7KqnFJphYgglFyy-7l3H_zXSHHQXRMtta3pyY9Rl5jmAnIoJ_LuD7nzY-in57RSmJdKiGyCcIZs8DEGqvU-NJ0JB42gj1L1LFVPUvVRqk6njJgzcX_UQuG3-P_QD8_Xelg</recordid><startdate>20110801</startdate><enddate>20110801</enddate><creator>Fodeh, Samah</creator><creator>Punch, Bill</creator><creator>Tan, Pang-Ning</creator><general>Springer-Verlag</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>0U~</scope><scope>1-H</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L.0</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope></search><sort><creationdate>20110801</creationdate><title>On ontology-driven document clustering using core semantic features</title><author>Fodeh, Samah ; Punch, Bill ; Tan, Pang-Ning</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Algorithms</topic><topic>Cluster analysis</topic><topic>Clustering</topic><topic>Clusters</topic><topic>Computer Science</topic><topic>Computer viruses</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Datasets</topic><topic>Document management</topic><topic>Documents</topic><topic>Empirical analysis</topic><topic>Gain</topic><topic>Information Storage and Retrieval</topic><topic>Information systems</topic><topic>Information Systems and Communication Service</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>IT in Business</topic><topic>Learning</topic><topic>Malware</topic><topic>Medicine</topic><topic>Ontology</topic><topic>Regular Paper</topic><topic>Semantics</topic><topic>Studies</topic><topic>Texts</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fodeh, Samah</creatorcontrib><creatorcontrib>Punch, Bill</creatorcontrib><creatorcontrib>Tan, Pang-Ning</creatorcontrib><collection>CrossRef</collection><collection>Global News & ABI/Inform Professional</collection><collection>Trade PRO</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ABI/INFORM Collection</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>ABI/INFORM Professional Standard</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Business</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Knowledge and information systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fodeh, Samah</au><au>Punch, Bill</au><au>Tan, Pang-Ning</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On ontology-driven document clustering using core semantic features</atitle><jtitle>Knowledge and information systems</jtitle><stitle>Knowl Inf Syst</stitle><date>2011-08-01</date><risdate>2011</risdate><volume>28</volume><issue>2</issue><spage>395</spage><epage>421</epage><pages>395-421</pages><issn>0219-1377</issn><eissn>0219-3116</eissn><coden>KISNCR</coden><abstract>Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.</abstract><cop>London</cop><pub>Springer-Verlag</pub><doi>10.1007/s10115-010-0370-4</doi><tpages>27</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0219-1377
ispartof	Knowledge and information systems, 2011-08, Vol.28 (2), p.395-421
issn	0219-1377 0219-3116
language	eng
recordid	cdi_proquest_miscellaneous_914620609
source	SpringerLink Journals - AutoHoldings
subjects	Algorithms Cluster analysis Clustering Clusters Computer Science Computer viruses Data Mining and Knowledge Discovery Database Management Datasets Document management Documents Empirical analysis Gain Information Storage and Retrieval Information systems Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Learning Malware Medicine Ontology Regular Paper Semantics Studies Texts
title	On ontology-driven document clustering using core semantic features
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T10%3A26%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20ontology-driven%20document%20clustering%20using%20core%20semantic%20features&rft.jtitle=Knowledge%20and%20information%20systems&rft.au=Fodeh,%20Samah&rft.date=2011-08-01&rft.volume=28&rft.issue=2&rft.spage=395&rft.epage=421&rft.pages=395-421&rft.issn=0219-1377&rft.eissn=0219-3116&rft.coden=KISNCR&rft_id=info:doi/10.1007/s10115-010-0370-4&rft_dat=%3Cproquest_cross%3E2420205101%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=881698225&rft_id=info:pmid/&rfr_iscdi=true