On ontology-driven document clustering using core semantic features
Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of f...
Gespeichert in:
Veröffentlicht in: | Knowledge and information systems 2011-08, Vol.28 (2), p.395-421 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 421 |
---|---|
container_issue | 2 |
container_start_page | 395 |
container_title | Knowledge and information systems |
container_volume | 28 |
creator | Fodeh, Samah Punch, Bill Tan, Pang-Ning |
description | Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that
polysemous
and
synonymous
nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a
core
subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus. |
doi_str_mv | 10.1007/s10115-010-0370-4 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_914620609</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2420205101</sourcerecordid><originalsourceid>FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</originalsourceid><addsrcrecordid>eNp1kE1LxDAQhoMouK7-AG_Fi6foTNKP9CiLX7CwFz2HNJkuXdpmTVph_71duiAIXmbm8Lwvw8PYLcIDAhSPEQEx44DAQRbA0zO2AIEll4j5-elGWRSX7CrGHQAWOeKCrTZ94vvBt3574C4039Qnztuxo35IbDvGgULTb5MxHqf1gZJInemHxiY1mWEMFK_ZRW3aSDenvWSfL88fqze-3ry-r57W3Mq0GLhzIgOqrZHSWCdMQViVVGe1qjGvbFqmRjhyUmaQO2coU7KqnFJphYgglFyy-7l3H_zXSHHQXRMtta3pyY9Rl5jmAnIoJ_LuD7nzY-in57RSmJdKiGyCcIZs8DEGqvU-NJ0JB42gj1L1LFVPUvVRqk6njJgzcX_UQuG3-P_QD8_Xelg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>881698225</pqid></control><display><type>article</type><title>On ontology-driven document clustering using core semantic features</title><source>SpringerLink Journals - AutoHoldings</source><creator>Fodeh, Samah ; Punch, Bill ; Tan, Pang-Ning</creator><creatorcontrib>Fodeh, Samah ; Punch, Bill ; Tan, Pang-Ning</creatorcontrib><description>Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that
polysemous
and
synonymous
nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a
core
subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.</description><identifier>ISSN: 0219-1377</identifier><identifier>EISSN: 0219-3116</identifier><identifier>DOI: 10.1007/s10115-010-0370-4</identifier><identifier>CODEN: KISNCR</identifier><language>eng</language><publisher>London: Springer-Verlag</publisher><subject>Algorithms ; Cluster analysis ; Clustering ; Clusters ; Computer Science ; Computer viruses ; Data Mining and Knowledge Discovery ; Database Management ; Datasets ; Document management ; Documents ; Empirical analysis ; Gain ; Information Storage and Retrieval ; Information systems ; Information Systems and Communication Service ; Information Systems Applications (incl.Internet) ; IT in Business ; Learning ; Malware ; Medicine ; Ontology ; Regular Paper ; Semantics ; Studies ; Texts</subject><ispartof>Knowledge and information systems, 2011-08, Vol.28 (2), p.395-421</ispartof><rights>Springer-Verlag London Limited 2011</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</citedby><cites>FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10115-010-0370-4$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10115-010-0370-4$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27923,27924,41487,42556,51318</link.rule.ids></links><search><creatorcontrib>Fodeh, Samah</creatorcontrib><creatorcontrib>Punch, Bill</creatorcontrib><creatorcontrib>Tan, Pang-Ning</creatorcontrib><title>On ontology-driven document clustering using core semantic features</title><title>Knowledge and information systems</title><addtitle>Knowl Inf Syst</addtitle><description>Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that
polysemous
and
synonymous
nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a
core
subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.</description><subject>Algorithms</subject><subject>Cluster analysis</subject><subject>Clustering</subject><subject>Clusters</subject><subject>Computer Science</subject><subject>Computer viruses</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Datasets</subject><subject>Document management</subject><subject>Documents</subject><subject>Empirical analysis</subject><subject>Gain</subject><subject>Information Storage and Retrieval</subject><subject>Information systems</subject><subject>Information Systems and Communication Service</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>IT in Business</subject><subject>Learning</subject><subject>Malware</subject><subject>Medicine</subject><subject>Ontology</subject><subject>Regular Paper</subject><subject>Semantics</subject><subject>Studies</subject><subject>Texts</subject><issn>0219-1377</issn><issn>0219-3116</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2011</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1kE1LxDAQhoMouK7-AG_Fi6foTNKP9CiLX7CwFz2HNJkuXdpmTVph_71duiAIXmbm8Lwvw8PYLcIDAhSPEQEx44DAQRbA0zO2AIEll4j5-elGWRSX7CrGHQAWOeKCrTZ94vvBt3574C4039Qnztuxo35IbDvGgULTb5MxHqf1gZJInemHxiY1mWEMFK_ZRW3aSDenvWSfL88fqze-3ry-r57W3Mq0GLhzIgOqrZHSWCdMQViVVGe1qjGvbFqmRjhyUmaQO2coU7KqnFJphYgglFyy-7l3H_zXSHHQXRMtta3pyY9Rl5jmAnIoJ_LuD7nzY-in57RSmJdKiGyCcIZs8DEGqvU-NJ0JB42gj1L1LFVPUvVRqk6njJgzcX_UQuG3-P_QD8_Xelg</recordid><startdate>20110801</startdate><enddate>20110801</enddate><creator>Fodeh, Samah</creator><creator>Punch, Bill</creator><creator>Tan, Pang-Ning</creator><general>Springer-Verlag</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>0U~</scope><scope>1-H</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L.0</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope></search><sort><creationdate>20110801</creationdate><title>On ontology-driven document clustering using core semantic features</title><author>Fodeh, Samah ; Punch, Bill ; Tan, Pang-Ning</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c347t-dd250efca33acd2a7e1b9ef5f8f16bc494a2ded33506ddae583bbd884b1110283</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Algorithms</topic><topic>Cluster analysis</topic><topic>Clustering</topic><topic>Clusters</topic><topic>Computer Science</topic><topic>Computer viruses</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Datasets</topic><topic>Document management</topic><topic>Documents</topic><topic>Empirical analysis</topic><topic>Gain</topic><topic>Information Storage and Retrieval</topic><topic>Information systems</topic><topic>Information Systems and Communication Service</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>IT in Business</topic><topic>Learning</topic><topic>Malware</topic><topic>Medicine</topic><topic>Ontology</topic><topic>Regular Paper</topic><topic>Semantics</topic><topic>Studies</topic><topic>Texts</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fodeh, Samah</creatorcontrib><creatorcontrib>Punch, Bill</creatorcontrib><creatorcontrib>Tan, Pang-Ning</creatorcontrib><collection>CrossRef</collection><collection>Global News & ABI/Inform Professional</collection><collection>Trade PRO</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ABI/INFORM Collection</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>ABI/INFORM Professional Standard</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Business</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Knowledge and information systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fodeh, Samah</au><au>Punch, Bill</au><au>Tan, Pang-Ning</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On ontology-driven document clustering using core semantic features</atitle><jtitle>Knowledge and information systems</jtitle><stitle>Knowl Inf Syst</stitle><date>2011-08-01</date><risdate>2011</risdate><volume>28</volume><issue>2</issue><spage>395</spage><epage>421</epage><pages>395-421</pages><issn>0219-1377</issn><eissn>0219-3116</eissn><coden>KISNCR</coden><abstract>Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that
polysemous
and
synonymous
nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a
core
subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.</abstract><cop>London</cop><pub>Springer-Verlag</pub><doi>10.1007/s10115-010-0370-4</doi><tpages>27</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0219-1377 |
ispartof | Knowledge and information systems, 2011-08, Vol.28 (2), p.395-421 |
issn | 0219-1377 0219-3116 |
language | eng |
recordid | cdi_proquest_miscellaneous_914620609 |
source | SpringerLink Journals - AutoHoldings |
subjects | Algorithms Cluster analysis Clustering Clusters Computer Science Computer viruses Data Mining and Knowledge Discovery Database Management Datasets Document management Documents Empirical analysis Gain Information Storage and Retrieval Information systems Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Learning Malware Medicine Ontology Regular Paper Semantics Studies Texts |
title | On ontology-driven document clustering using core semantic features |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T10%3A26%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20ontology-driven%20document%20clustering%20using%20core%20semantic%20features&rft.jtitle=Knowledge%20and%20information%20systems&rft.au=Fodeh,%20Samah&rft.date=2011-08-01&rft.volume=28&rft.issue=2&rft.spage=395&rft.epage=421&rft.pages=395-421&rft.issn=0219-1377&rft.eissn=0219-3116&rft.coden=KISNCR&rft_id=info:doi/10.1007/s10115-010-0370-4&rft_dat=%3Cproquest_cross%3E2420205101%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=881698225&rft_id=info:pmid/&rfr_iscdi=true |