A Communication Perspective on Automatic Text Categorization
The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the to...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on knowledge and data engineering 2009-07, Vol.21 (7), p.1027-1041 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1041 |
---|---|
container_issue | 7 |
container_start_page | 1027 |
container_title | IEEE transactions on knowledge and data engineering |
container_volume | 21 |
creator | Capdevila, M. Florez, O.W.M. |
description | The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs). |
doi_str_mv | 10.1109/TKDE.2009.22 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_pascalfrancis_primary_21806130</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4752825</ieee_id><sourcerecordid>2543469371</sourcerecordid><originalsourceid>FETCH-LOGICAL-c344t-b78651764f171e74e7579c30bb1d27c548a67b1d4694ceb0934e87450a1b2f133</originalsourceid><addsrcrecordid>eNpdkM1Lw0AQxYMoqNWbNy9BEC-mzuxHdgNeSv3Egh7qedmsE0lpkrqbiPrXu7XiwdO84f3mMbwkOUIYI0JxMX-4uh4zgGLM2Fayh1LqjGGB21GDwExwoXaT_RAWAKCVxr3kcpJOu6YZ2trZvu7a9Il8WJHr63dK4zoZ-q6Jjkvn9NGnU9vTa-frrx_4INmp7DLQ4e8cJc831_PpXTZ7vL2fTmaZ40L0Wal0LlHlokKFpAQpqQrHoSzxhSknhba5ilrkhXBUQsEFaSUkWCxZhZyPkrNN7sp3bwOF3jR1cLRc2pa6IRitJGDOQETy5B-56AbfxudMgQw4aiEjdL6BnO9C8FSZla8b6z8NglkXadZFmnWRhrGIn_5m2uDssvK2dXX4u2GoIUcOkTvecDUR_dlCSaaZ5N_PTnkm</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>912031845</pqid></control><display><type>article</type><title>A Communication Perspective on Automatic Text Categorization</title><source>IEEE Electronic Library (IEL)</source><creator>Capdevila, M. ; Florez, O.W.M.</creator><creatorcontrib>Capdevila, M. ; Florez, O.W.M.</creatorcontrib><description>The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2009.22</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>New York, NY: IEEE</publisher><subject>Applied sciences ; Artificial intelligence ; Bayesian analysis ; Categories ; Classification ; classifier design and evaluation ; clustering ; Communication systems ; Communications systems ; Computer science; control theory; systems ; Data communications ; data compaction and compression ; Data processing. List processing. Character string processing ; Decoding ; Exact sciences and technology ; feature evaluation and selection ; Filtering ; Filters ; Gaussian ; Indexing ; Memory organisation. Data processing ; Noise reduction ; Reduction ; Redundancy ; Software ; Speech and sound recognition and synthesis. Linguistics ; Studies ; Support vector machines ; Text categorization ; text processing ; Texts ; Vocabulary</subject><ispartof>IEEE transactions on knowledge and data engineering, 2009-07, Vol.21 (7), p.1027-1041</ispartof><rights>2009 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2009</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c344t-b78651764f171e74e7579c30bb1d27c548a67b1d4694ceb0934e87450a1b2f133</citedby><cites>FETCH-LOGICAL-c344t-b78651764f171e74e7579c30bb1d27c548a67b1d4694ceb0934e87450a1b2f133</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4752825$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4752825$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=21806130$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Capdevila, M.</creatorcontrib><creatorcontrib>Florez, O.W.M.</creatorcontrib><title>A Communication Perspective on Automatic Text Categorization</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><description>The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).</description><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Bayesian analysis</subject><subject>Categories</subject><subject>Classification</subject><subject>classifier design and evaluation</subject><subject>clustering</subject><subject>Communication systems</subject><subject>Communications systems</subject><subject>Computer science; control theory; systems</subject><subject>Data communications</subject><subject>data compaction and compression</subject><subject>Data processing. List processing. Character string processing</subject><subject>Decoding</subject><subject>Exact sciences and technology</subject><subject>feature evaluation and selection</subject><subject>Filtering</subject><subject>Filters</subject><subject>Gaussian</subject><subject>Indexing</subject><subject>Memory organisation. Data processing</subject><subject>Noise reduction</subject><subject>Reduction</subject><subject>Redundancy</subject><subject>Software</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><subject>Studies</subject><subject>Support vector machines</subject><subject>Text categorization</subject><subject>text processing</subject><subject>Texts</subject><subject>Vocabulary</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2009</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkM1Lw0AQxYMoqNWbNy9BEC-mzuxHdgNeSv3Egh7qedmsE0lpkrqbiPrXu7XiwdO84f3mMbwkOUIYI0JxMX-4uh4zgGLM2Fayh1LqjGGB21GDwExwoXaT_RAWAKCVxr3kcpJOu6YZ2trZvu7a9Il8WJHr63dK4zoZ-q6Jjkvn9NGnU9vTa-frrx_4INmp7DLQ4e8cJc831_PpXTZ7vL2fTmaZ40L0Wal0LlHlokKFpAQpqQrHoSzxhSknhba5ilrkhXBUQsEFaSUkWCxZhZyPkrNN7sp3bwOF3jR1cLRc2pa6IRitJGDOQETy5B-56AbfxudMgQw4aiEjdL6BnO9C8FSZla8b6z8NglkXadZFmnWRhrGIn_5m2uDssvK2dXX4u2GoIUcOkTvecDUR_dlCSaaZ5N_PTnkm</recordid><startdate>20090701</startdate><enddate>20090701</enddate><creator>Capdevila, M.</creator><creator>Florez, O.W.M.</creator><general>IEEE</general><general>IEEE Computer Society</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>F28</scope><scope>FR3</scope></search><sort><creationdate>20090701</creationdate><title>A Communication Perspective on Automatic Text Categorization</title><author>Capdevila, M. ; Florez, O.W.M.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c344t-b78651764f171e74e7579c30bb1d27c548a67b1d4694ceb0934e87450a1b2f133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2009</creationdate><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Bayesian analysis</topic><topic>Categories</topic><topic>Classification</topic><topic>classifier design and evaluation</topic><topic>clustering</topic><topic>Communication systems</topic><topic>Communications systems</topic><topic>Computer science; control theory; systems</topic><topic>Data communications</topic><topic>data compaction and compression</topic><topic>Data processing. List processing. Character string processing</topic><topic>Decoding</topic><topic>Exact sciences and technology</topic><topic>feature evaluation and selection</topic><topic>Filtering</topic><topic>Filters</topic><topic>Gaussian</topic><topic>Indexing</topic><topic>Memory organisation. Data processing</topic><topic>Noise reduction</topic><topic>Reduction</topic><topic>Redundancy</topic><topic>Software</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><topic>Studies</topic><topic>Support vector machines</topic><topic>Text categorization</topic><topic>text processing</topic><topic>Texts</topic><topic>Vocabulary</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Capdevila, M.</creatorcontrib><creatorcontrib>Florez, O.W.M.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Capdevila, M.</au><au>Florez, O.W.M.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Communication Perspective on Automatic Text Categorization</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><date>2009-07-01</date><risdate>2009</risdate><volume>21</volume><issue>7</issue><spage>1027</spage><epage>1041</epage><pages>1027-1041</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).</abstract><cop>New York, NY</cop><pub>IEEE</pub><doi>10.1109/TKDE.2009.22</doi><tpages>15</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1041-4347 |
ispartof | IEEE transactions on knowledge and data engineering, 2009-07, Vol.21 (7), p.1027-1041 |
issn | 1041-4347 1558-2191 |
language | eng |
recordid | cdi_pascalfrancis_primary_21806130 |
source | IEEE Electronic Library (IEL) |
subjects | Applied sciences Artificial intelligence Bayesian analysis Categories Classification classifier design and evaluation clustering Communication systems Communications systems Computer science control theory systems Data communications data compaction and compression Data processing. List processing. Character string processing Decoding Exact sciences and technology feature evaluation and selection Filtering Filters Gaussian Indexing Memory organisation. Data processing Noise reduction Reduction Redundancy Software Speech and sound recognition and synthesis. Linguistics Studies Support vector machines Text categorization text processing Texts Vocabulary |
title | A Communication Perspective on Automatic Text Categorization |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T07%3A40%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Communication%20Perspective%20on%20Automatic%20Text%20Categorization&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Capdevila,%20M.&rft.date=2009-07-01&rft.volume=21&rft.issue=7&rft.spage=1027&rft.epage=1041&rft.pages=1027-1041&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2009.22&rft_dat=%3Cproquest_RIE%3E2543469371%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=912031845&rft_id=info:pmid/&rft_ieee_id=4752825&rfr_iscdi=true |