On ontology-driven document clustering using core semantic features

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of f...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge and information systems 2011-08, Vol.28 (2), p.395-421
Hauptverfasser: Fodeh, Samah, Punch, Bill, Tan, Pang-Ning
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 421
container_issue 2
container_start_page 395
container_title Knowledge and information systems
container_volume 28
creator Fodeh, Samah
Punch, Bill
Tan, Pang-Ning
description Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.
doi_str_mv 10.1007/s10115-010-0370-4
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_914620609</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2420205101</sourcerecordid><originalsourceid>FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</originalsourceid><addsrcrecordid>eNp1kE1LxDAQhoMouK7-AG_Fi6foTNKP9CiLX7CwFz2HNJkuXdpmTVph_71duiAIXmbm8Lwvw8PYLcIDAhSPEQEx44DAQRbA0zO2AIEll4j5-elGWRSX7CrGHQAWOeKCrTZ94vvBt3574C4039Qnztuxo35IbDvGgULTb5MxHqf1gZJInemHxiY1mWEMFK_ZRW3aSDenvWSfL88fqze-3ry-r57W3Mq0GLhzIgOqrZHSWCdMQViVVGe1qjGvbFqmRjhyUmaQO2coU7KqnFJphYgglFyy-7l3H_zXSHHQXRMtta3pyY9Rl5jmAnIoJ_LuD7nzY-in57RSmJdKiGyCcIZs8DEGqvU-NJ0JB42gj1L1LFVPUvVRqk6njJgzcX_UQuG3-P_QD8_Xelg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>881698225</pqid></control><display><type>article</type><title>On ontology-driven document clustering using core semantic features</title><source>SpringerLink Journals - AutoHoldings</source><creator>Fodeh, Samah ; Punch, Bill ; Tan, Pang-Ning</creator><creatorcontrib>Fodeh, Samah ; Punch, Bill ; Tan, Pang-Ning</creatorcontrib><description>Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.</description><identifier>ISSN: 0219-1377</identifier><identifier>EISSN: 0219-3116</identifier><identifier>DOI: 10.1007/s10115-010-0370-4</identifier><identifier>CODEN: KISNCR</identifier><language>eng</language><publisher>London: Springer-Verlag</publisher><subject>Algorithms ; Cluster analysis ; Clustering ; Clusters ; Computer Science ; Computer viruses ; Data Mining and Knowledge Discovery ; Database Management ; Datasets ; Document management ; Documents ; Empirical analysis ; Gain ; Information Storage and Retrieval ; Information systems ; Information Systems and Communication Service ; Information Systems Applications (incl.Internet) ; IT in Business ; Learning ; Malware ; Medicine ; Ontology ; Regular Paper ; Semantics ; Studies ; Texts</subject><ispartof>Knowledge and information systems, 2011-08, Vol.28 (2), p.395-421</ispartof><rights>Springer-Verlag London Limited 2011</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</citedby><cites>FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10115-010-0370-4$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10115-010-0370-4$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27923,27924,41487,42556,51318</link.rule.ids></links><search><creatorcontrib>Fodeh, Samah</creatorcontrib><creatorcontrib>Punch, Bill</creatorcontrib><creatorcontrib>Tan, Pang-Ning</creatorcontrib><title>On ontology-driven document clustering using core semantic features</title><title>Knowledge and information systems</title><addtitle>Knowl Inf Syst</addtitle><description>Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.</description><subject>Algorithms</subject><subject>Cluster analysis</subject><subject>Clustering</subject><subject>Clusters</subject><subject>Computer Science</subject><subject>Computer viruses</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Datasets</subject><subject>Document management</subject><subject>Documents</subject><subject>Empirical analysis</subject><subject>Gain</subject><subject>Information Storage and Retrieval</subject><subject>Information systems</subject><subject>Information Systems and Communication Service</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>IT in Business</subject><subject>Learning</subject><subject>Malware</subject><subject>Medicine</subject><subject>Ontology</subject><subject>Regular Paper</subject><subject>Semantics</subject><subject>Studies</subject><subject>Texts</subject><issn>0219-1377</issn><issn>0219-3116</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2011</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1kE1LxDAQhoMouK7-AG_Fi6foTNKP9CiLX7CwFz2HNJkuXdpmTVph_71duiAIXmbm8Lwvw8PYLcIDAhSPEQEx44DAQRbA0zO2AIEll4j5-elGWRSX7CrGHQAWOeKCrTZ94vvBt3574C4039Qnztuxo35IbDvGgULTb5MxHqf1gZJInemHxiY1mWEMFK_ZRW3aSDenvWSfL88fqze-3ry-r57W3Mq0GLhzIgOqrZHSWCdMQViVVGe1qjGvbFqmRjhyUmaQO2coU7KqnFJphYgglFyy-7l3H_zXSHHQXRMtta3pyY9Rl5jmAnIoJ_LuD7nzY-in57RSmJdKiGyCcIZs8DEGqvU-NJ0JB42gj1L1LFVPUvVRqk6njJgzcX_UQuG3-P_QD8_Xelg</recordid><startdate>20110801</startdate><enddate>20110801</enddate><creator>Fodeh, Samah</creator><creator>Punch, Bill</creator><creator>Tan, Pang-Ning</creator><general>Springer-Verlag</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>0U~</scope><scope>1-H</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L.0</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope></search><sort><creationdate>20110801</creationdate><title>On ontology-driven document clustering using core semantic features</title><author>Fodeh, Samah ; Punch, Bill ; Tan, Pang-Ning</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Algorithms</topic><topic>Cluster analysis</topic><topic>Clustering</topic><topic>Clusters</topic><topic>Computer Science</topic><topic>Computer viruses</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Datasets</topic><topic>Document management</topic><topic>Documents</topic><topic>Empirical analysis</topic><topic>Gain</topic><topic>Information Storage and Retrieval</topic><topic>Information systems</topic><topic>Information Systems and Communication Service</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>IT in Business</topic><topic>Learning</topic><topic>Malware</topic><topic>Medicine</topic><topic>Ontology</topic><topic>Regular Paper</topic><topic>Semantics</topic><topic>Studies</topic><topic>Texts</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fodeh, Samah</creatorcontrib><creatorcontrib>Punch, Bill</creatorcontrib><creatorcontrib>Tan, Pang-Ning</creatorcontrib><collection>CrossRef</collection><collection>Global News &amp; ABI/Inform Professional</collection><collection>Trade PRO</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ABI/INFORM Collection</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>ABI/INFORM Professional Standard</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Business</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Knowledge and information systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fodeh, Samah</au><au>Punch, Bill</au><au>Tan, Pang-Ning</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On ontology-driven document clustering using core semantic features</atitle><jtitle>Knowledge and information systems</jtitle><stitle>Knowl Inf Syst</stitle><date>2011-08-01</date><risdate>2011</risdate><volume>28</volume><issue>2</issue><spage>395</spage><epage>421</epage><pages>395-421</pages><issn>0219-1377</issn><eissn>0219-3116</eissn><coden>KISNCR</coden><abstract>Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.</abstract><cop>London</cop><pub>Springer-Verlag</pub><doi>10.1007/s10115-010-0370-4</doi><tpages>27</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0219-1377
ispartof Knowledge and information systems, 2011-08, Vol.28 (2), p.395-421
issn 0219-1377
0219-3116
language eng
recordid cdi_proquest_miscellaneous_914620609
source SpringerLink Journals - AutoHoldings
subjects Algorithms
Cluster analysis
Clustering
Clusters
Computer Science
Computer viruses
Data Mining and Knowledge Discovery
Database Management
Datasets
Document management
Documents
Empirical analysis
Gain
Information Storage and Retrieval
Information systems
Information Systems and Communication Service
Information Systems Applications (incl.Internet)
IT in Business
Learning
Malware
Medicine
Ontology
Regular Paper
Semantics
Studies
Texts
title On ontology-driven document clustering using core semantic features
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T10%3A26%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20ontology-driven%20document%20clustering%20using%20core%20semantic%20features&rft.jtitle=Knowledge%20and%20information%20systems&rft.au=Fodeh,%20Samah&rft.date=2011-08-01&rft.volume=28&rft.issue=2&rft.spage=395&rft.epage=421&rft.pages=395-421&rft.issn=0219-1377&rft.eissn=0219-3116&rft.coden=KISNCR&rft_id=info:doi/10.1007/s10115-010-0370-4&rft_dat=%3Cproquest_cross%3E2420205101%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=881698225&rft_id=info:pmid/&rfr_iscdi=true