Fuzzy Bag-of-Words Model for Document Representation

One key issue in text mining and natural language processing is how to effectively represent documents using numerical vectors. One classical model is the Bag-of-Words (BoW). In a BoW-based vector representation of a document, each element denotes the normalized number of occurrence of a basis term...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on fuzzy systems 2018-04, Vol.26 (2), p.794-804
Hauptverfasser:	Zhao, Rui, Mao, Kezhi
Format:	Artikel
Sprache:	eng
Schlagworte:	Analytical models Computational modeling Document classification document representation Dogs fuzzy similarity Numerical models Semantics Text mining Vocabulary word embeddings
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	804
container_issue	2
container_start_page	794
container_title	IEEE transactions on fuzzy systems
container_volume	26
creator	Zhao, Rui Mao, Kezhi
description	One key issue in text mining and natural language processing is how to effectively represent documents using numerical vectors. One classical model is the Bag-of-Words (BoW). In a BoW-based vector representation of a document, each element denotes the normalized number of occurrence of a basis term in the document. To count the number of occurrence of a basis term, BoW conducts exact word matching, which can be regarded as a hard mapping from words to the basis term. BoW representation suffers from its intrinsic extreme sparsity, high dimensionality, and inability to capture high-level semantic meanings behind text data. To address the aforementioned issues, we propose a new document representation method named fuzzy Bag-of-Words (FBoW) in this paper. FBoW adopts a fuzzy mapping based on semantic correlation among words quantified by cosine similarity measures between word embeddings. Since word semantic matching instead of exact word string matching is used, the FBoW could encode more semantics into the numerical representation. In addition, we propose to use word clusters instead of individual words as basis terms and develop fuzzy Bag-of-WordClusters (FBoWC) models. Three variants under the framework of FBoWC are proposed based on three different similarity measures between word clusters and words, which are named as \text{FBoWC}_{\rm mean}, \text{FBoWC}_{\rm max}, and \text{FBoWC}_{\rm min}, respectively. Document representations learned by the proposed FBoW and FBoWC are dense and able to encode high-level semantics. The task of document categorization is used to evaluate the performance of learned representation by the proposed FBoW and FBoWC methods. The results on seven real-word document classification datasets in comparison with six document representation learning methods have shown that our methods FBoW and FBoWC achieve the highest classification accuracies.
doi_str_mv	10.1109/TFUZZ.2017.2690222
format	Article
fullrecord	<record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_7891009</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7891009</ieee_id><sourcerecordid>10_1109_TFUZZ_2017_2690222</sourcerecordid><originalsourceid>FETCH-LOGICAL-c333t-2a0de9ebd07d19831276f92de3e98d5076614fc1464ba1e5fb7e2815e49e29e63</originalsourceid><addsrcrecordid>eNo9j8FOAjEURRujiYj-gG7mB4rvtZ12ulQUNcGYGIgJm0ln-mowQEkLC_h6ByGu7tmcmxzGbhEGiGDvJ6PpbDYQgGYgtAUhxBnroVXIAaQ67xi05NqAvmRXOf8AoCqx6jE12u73u-LRffMY-FdMPhfv0dOiCDEVT7HdLmm1KT5pnSh35DbzuLpmF8EtMt2cts-mo-fJ8JWPP17ehg9j3kopN1w48GSp8WA82kqiMDpY4UmSrXwJRmtUoUWlVeOQytAYEhWWpCwJS1r2mTj-tinmnCjU6zRfurSrEepDd_3XXR-661N3J90dpTkR_Qumsghg5S8MQVPQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Fuzzy Bag-of-Words Model for Document Representation</title><source>IEEE Electronic Library (IEL)</source><creator>Zhao, Rui ; Mao, Kezhi</creator><creatorcontrib>Zhao, Rui ; Mao, Kezhi</creatorcontrib><description><![CDATA[One key issue in text mining and natural language processing is how to effectively represent documents using numerical vectors. One classical model is the Bag-of-Words (BoW). In a BoW-based vector representation of a document, each element denotes the normalized number of occurrence of a basis term in the document. To count the number of occurrence of a basis term, BoW conducts exact word matching, which can be regarded as a hard mapping from words to the basis term. BoW representation suffers from its intrinsic extreme sparsity, high dimensionality, and inability to capture high-level semantic meanings behind text data. To address the aforementioned issues, we propose a new document representation method named fuzzy Bag-of-Words (FBoW) in this paper. FBoW adopts a fuzzy mapping based on semantic correlation among words quantified by cosine similarity measures between word embeddings. Since word semantic matching instead of exact word string matching is used, the FBoW could encode more semantics into the numerical representation. In addition, we propose to use word clusters instead of individual words as basis terms and develop fuzzy Bag-of-WordClusters (FBoWC) models. Three variants under the framework of FBoWC are proposed based on three different similarity measures between word clusters and words, which are named as <inline-formula><tex-math notation="LaTeX">\text{FBoWC}_{\rm mean}</tex-math></inline-formula>, <inline-formula><tex-math notation="LaTeX">\text{FBoWC}_{\rm max}</tex-math></inline-formula>, and <inline-formula><tex-math notation="LaTeX">\text{FBoWC}_{\rm min}</tex-math></inline-formula>, respectively. Document representations learned by the proposed FBoW and FBoWC are dense and able to encode high-level semantics. The task of document categorization is used to evaluate the performance of learned representation by the proposed FBoW and FBoWC methods. The results on seven real-word document classification datasets in comparison with six document representation learning methods have shown that our methods FBoW and FBoWC achieve the highest classification accuracies.]]></description><identifier>ISSN: 1063-6706</identifier><identifier>EISSN: 1941-0034</identifier><identifier>DOI: 10.1109/TFUZZ.2017.2690222</identifier><identifier>CODEN: IEFSEV</identifier><language>eng</language><publisher>IEEE</publisher><subject>Analytical models ; Computational modeling ; Document classification ; document representation ; Dogs ; fuzzy similarity ; Numerical models ; Semantics ; Text mining ; Vocabulary ; word embeddings</subject><ispartof>IEEE transactions on fuzzy systems, 2018-04, Vol.26 (2), p.794-804</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c333t-2a0de9ebd07d19831276f92de3e98d5076614fc1464ba1e5fb7e2815e49e29e63</citedby><cites>FETCH-LOGICAL-c333t-2a0de9ebd07d19831276f92de3e98d5076614fc1464ba1e5fb7e2815e49e29e63</cites><orcidid>0000-0002-9699-9984</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7891009$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,778,782,794,27911,27912,54745</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7891009$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhao, Rui</creatorcontrib><creatorcontrib>Mao, Kezhi</creatorcontrib><title>Fuzzy Bag-of-Words Model for Document Representation</title><title>IEEE transactions on fuzzy systems</title><addtitle>TFUZZ</addtitle><description><![CDATA[One key issue in text mining and natural language processing is how to effectively represent documents using numerical vectors. One classical model is the Bag-of-Words (BoW). In a BoW-based vector representation of a document, each element denotes the normalized number of occurrence of a basis term in the document. To count the number of occurrence of a basis term, BoW conducts exact word matching, which can be regarded as a hard mapping from words to the basis term. BoW representation suffers from its intrinsic extreme sparsity, high dimensionality, and inability to capture high-level semantic meanings behind text data. To address the aforementioned issues, we propose a new document representation method named fuzzy Bag-of-Words (FBoW) in this paper. FBoW adopts a fuzzy mapping based on semantic correlation among words quantified by cosine similarity measures between word embeddings. Since word semantic matching instead of exact word string matching is used, the FBoW could encode more semantics into the numerical representation. In addition, we propose to use word clusters instead of individual words as basis terms and develop fuzzy Bag-of-WordClusters (FBoWC) models. Three variants under the framework of FBoWC are proposed based on three different similarity measures between word clusters and words, which are named as <inline-formula><tex-math notation="LaTeX">\text{FBoWC}_{\rm mean}</tex-math></inline-formula>, <inline-formula><tex-math notation="LaTeX">\text{FBoWC}_{\rm max}</tex-math></inline-formula>, and <inline-formula><tex-math notation="LaTeX">\text{FBoWC}_{\rm min}</tex-math></inline-formula>, respectively. Document representations learned by the proposed FBoW and FBoWC are dense and able to encode high-level semantics. The task of document categorization is used to evaluate the performance of learned representation by the proposed FBoW and FBoWC methods. The results on seven real-word document classification datasets in comparison with six document representation learning methods have shown that our methods FBoW and FBoWC achieve the highest classification accuracies.]]></description><subject>Analytical models</subject><subject>Computational modeling</subject><subject>Document classification</subject><subject>document representation</subject><subject>Dogs</subject><subject>fuzzy similarity</subject><subject>Numerical models</subject><subject>Semantics</subject><subject>Text mining</subject><subject>Vocabulary</subject><subject>word embeddings</subject><issn>1063-6706</issn><issn>1941-0034</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9j8FOAjEURRujiYj-gG7mB4rvtZ12ulQUNcGYGIgJm0ln-mowQEkLC_h6ByGu7tmcmxzGbhEGiGDvJ6PpbDYQgGYgtAUhxBnroVXIAaQ67xi05NqAvmRXOf8AoCqx6jE12u73u-LRffMY-FdMPhfv0dOiCDEVT7HdLmm1KT5pnSh35DbzuLpmF8EtMt2cts-mo-fJ8JWPP17ehg9j3kopN1w48GSp8WA82kqiMDpY4UmSrXwJRmtUoUWlVeOQytAYEhWWpCwJS1r2mTj-tinmnCjU6zRfurSrEepDd_3XXR-661N3J90dpTkR_Qumsghg5S8MQVPQ</recordid><startdate>201804</startdate><enddate>201804</enddate><creator>Zhao, Rui</creator><creator>Mao, Kezhi</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-9699-9984</orcidid></search><sort><creationdate>201804</creationdate><title>Fuzzy Bag-of-Words Model for Document Representation</title><author>Zhao, Rui ; Mao, Kezhi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c333t-2a0de9ebd07d19831276f92de3e98d5076614fc1464ba1e5fb7e2815e49e29e63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Analytical models</topic><topic>Computational modeling</topic><topic>Document classification</topic><topic>document representation</topic><topic>Dogs</topic><topic>fuzzy similarity</topic><topic>Numerical models</topic><topic>Semantics</topic><topic>Text mining</topic><topic>Vocabulary</topic><topic>word embeddings</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Rui</creatorcontrib><creatorcontrib>Mao, Kezhi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on fuzzy systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhao, Rui</au><au>Mao, Kezhi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Fuzzy Bag-of-Words Model for Document Representation</atitle><jtitle>IEEE transactions on fuzzy systems</jtitle><stitle>TFUZZ</stitle><date>2018-04</date><risdate>2018</risdate><volume>26</volume><issue>2</issue><spage>794</spage><epage>804</epage><pages>794-804</pages><issn>1063-6706</issn><eissn>1941-0034</eissn><coden>IEFSEV</coden><abstract><![CDATA[One key issue in text mining and natural language processing is how to effectively represent documents using numerical vectors. One classical model is the Bag-of-Words (BoW). In a BoW-based vector representation of a document, each element denotes the normalized number of occurrence of a basis term in the document. To count the number of occurrence of a basis term, BoW conducts exact word matching, which can be regarded as a hard mapping from words to the basis term. BoW representation suffers from its intrinsic extreme sparsity, high dimensionality, and inability to capture high-level semantic meanings behind text data. To address the aforementioned issues, we propose a new document representation method named fuzzy Bag-of-Words (FBoW) in this paper. FBoW adopts a fuzzy mapping based on semantic correlation among words quantified by cosine similarity measures between word embeddings. Since word semantic matching instead of exact word string matching is used, the FBoW could encode more semantics into the numerical representation. In addition, we propose to use word clusters instead of individual words as basis terms and develop fuzzy Bag-of-WordClusters (FBoWC) models. Three variants under the framework of FBoWC are proposed based on three different similarity measures between word clusters and words, which are named as <inline-formula><tex-math notation="LaTeX">\text{FBoWC}_{\rm mean}</tex-math></inline-formula>, <inline-formula><tex-math notation="LaTeX">\text{FBoWC}_{\rm max}</tex-math></inline-formula>, and <inline-formula><tex-math notation="LaTeX">\text{FBoWC}_{\rm min}</tex-math></inline-formula>, respectively. Document representations learned by the proposed FBoW and FBoWC are dense and able to encode high-level semantics. The task of document categorization is used to evaluate the performance of learned representation by the proposed FBoW and FBoWC methods. The results on seven real-word document classification datasets in comparison with six document representation learning methods have shown that our methods FBoW and FBoWC achieve the highest classification accuracies.]]></abstract><pub>IEEE</pub><doi>10.1109/TFUZZ.2017.2690222</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-9699-9984</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1063-6706
ispartof	IEEE transactions on fuzzy systems, 2018-04, Vol.26 (2), p.794-804
issn	1063-6706 1941-0034
language	eng
recordid	cdi_ieee_primary_7891009
source	IEEE Electronic Library (IEL)
subjects	Analytical models Computational modeling Document classification document representation Dogs fuzzy similarity Numerical models Semantics Text mining Vocabulary word embeddings
title	Fuzzy Bag-of-Words Model for Document Representation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T11%3A06%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Fuzzy%20Bag-of-Words%20Model%20for%20Document%20Representation&rft.jtitle=IEEE%20transactions%20on%20fuzzy%20systems&rft.au=Zhao,%20Rui&rft.date=2018-04&rft.volume=26&rft.issue=2&rft.spage=794&rft.epage=804&rft.pages=794-804&rft.issn=1063-6706&rft.eissn=1941-0034&rft.coden=IEFSEV&rft_id=info:doi/10.1109/TFUZZ.2017.2690222&rft_dat=%3Ccrossref_RIE%3E10_1109_TFUZZ_2017_2690222%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=7891009&rfr_iscdi=true