A Communication Perspective on Automatic Text Categorization

The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the to...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2009-07, Vol.21 (7), p.1027-1041
Hauptverfasser: Capdevila, M., Florez, O.W.M.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1041
container_issue 7
container_start_page 1027
container_title IEEE transactions on knowledge and data engineering
container_volume 21
creator Capdevila, M.
Florez, O.W.M.
description The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).
doi_str_mv 10.1109/TKDE.2009.22
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_pascalfrancis_primary_21806130</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4752825</ieee_id><sourcerecordid>2543469371</sourcerecordid><originalsourceid>FETCH-LOGICAL-c344t-b78651764f171e74e7579c30bb1d27c548a67b1d4694ceb0934e87450a1b2f133</originalsourceid><addsrcrecordid>eNpdkM1Lw0AQxYMoqNWbNy9BEC-mzuxHdgNeSv3Egh7qedmsE0lpkrqbiPrXu7XiwdO84f3mMbwkOUIYI0JxMX-4uh4zgGLM2Fayh1LqjGGB21GDwExwoXaT_RAWAKCVxr3kcpJOu6YZ2trZvu7a9Il8WJHr63dK4zoZ-q6Jjkvn9NGnU9vTa-frrx_4INmp7DLQ4e8cJc831_PpXTZ7vL2fTmaZ40L0Wal0LlHlokKFpAQpqQrHoSzxhSknhba5ilrkhXBUQsEFaSUkWCxZhZyPkrNN7sp3bwOF3jR1cLRc2pa6IRitJGDOQETy5B-56AbfxudMgQw4aiEjdL6BnO9C8FSZla8b6z8NglkXadZFmnWRhrGIn_5m2uDssvK2dXX4u2GoIUcOkTvecDUR_dlCSaaZ5N_PTnkm</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>912031845</pqid></control><display><type>article</type><title>A Communication Perspective on Automatic Text Categorization</title><source>IEEE Electronic Library (IEL)</source><creator>Capdevila, M. ; Florez, O.W.M.</creator><creatorcontrib>Capdevila, M. ; Florez, O.W.M.</creatorcontrib><description>The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2009.22</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>New York, NY: IEEE</publisher><subject>Applied sciences ; Artificial intelligence ; Bayesian analysis ; Categories ; Classification ; classifier design and evaluation ; clustering ; Communication systems ; Communications systems ; Computer science; control theory; systems ; Data communications ; data compaction and compression ; Data processing. List processing. Character string processing ; Decoding ; Exact sciences and technology ; feature evaluation and selection ; Filtering ; Filters ; Gaussian ; Indexing ; Memory organisation. Data processing ; Noise reduction ; Reduction ; Redundancy ; Software ; Speech and sound recognition and synthesis. Linguistics ; Studies ; Support vector machines ; Text categorization ; text processing ; Texts ; Vocabulary</subject><ispartof>IEEE transactions on knowledge and data engineering, 2009-07, Vol.21 (7), p.1027-1041</ispartof><rights>2009 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2009</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c344t-b78651764f171e74e7579c30bb1d27c548a67b1d4694ceb0934e87450a1b2f133</citedby><cites>FETCH-LOGICAL-c344t-b78651764f171e74e7579c30bb1d27c548a67b1d4694ceb0934e87450a1b2f133</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4752825$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4752825$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=21806130$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Capdevila, M.</creatorcontrib><creatorcontrib>Florez, O.W.M.</creatorcontrib><title>A Communication Perspective on Automatic Text Categorization</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><description>The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).</description><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Bayesian analysis</subject><subject>Categories</subject><subject>Classification</subject><subject>classifier design and evaluation</subject><subject>clustering</subject><subject>Communication systems</subject><subject>Communications systems</subject><subject>Computer science; control theory; systems</subject><subject>Data communications</subject><subject>data compaction and compression</subject><subject>Data processing. List processing. Character string processing</subject><subject>Decoding</subject><subject>Exact sciences and technology</subject><subject>feature evaluation and selection</subject><subject>Filtering</subject><subject>Filters</subject><subject>Gaussian</subject><subject>Indexing</subject><subject>Memory organisation. Data processing</subject><subject>Noise reduction</subject><subject>Reduction</subject><subject>Redundancy</subject><subject>Software</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><subject>Studies</subject><subject>Support vector machines</subject><subject>Text categorization</subject><subject>text processing</subject><subject>Texts</subject><subject>Vocabulary</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2009</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkM1Lw0AQxYMoqNWbNy9BEC-mzuxHdgNeSv3Egh7qedmsE0lpkrqbiPrXu7XiwdO84f3mMbwkOUIYI0JxMX-4uh4zgGLM2Fayh1LqjGGB21GDwExwoXaT_RAWAKCVxr3kcpJOu6YZ2trZvu7a9Il8WJHr63dK4zoZ-q6Jjkvn9NGnU9vTa-frrx_4INmp7DLQ4e8cJc831_PpXTZ7vL2fTmaZ40L0Wal0LlHlokKFpAQpqQrHoSzxhSknhba5ilrkhXBUQsEFaSUkWCxZhZyPkrNN7sp3bwOF3jR1cLRc2pa6IRitJGDOQETy5B-56AbfxudMgQw4aiEjdL6BnO9C8FSZla8b6z8NglkXadZFmnWRhrGIn_5m2uDssvK2dXX4u2GoIUcOkTvecDUR_dlCSaaZ5N_PTnkm</recordid><startdate>20090701</startdate><enddate>20090701</enddate><creator>Capdevila, M.</creator><creator>Florez, O.W.M.</creator><general>IEEE</general><general>IEEE Computer Society</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>F28</scope><scope>FR3</scope></search><sort><creationdate>20090701</creationdate><title>A Communication Perspective on Automatic Text Categorization</title><author>Capdevila, M. ; Florez, O.W.M.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c344t-b78651764f171e74e7579c30bb1d27c548a67b1d4694ceb0934e87450a1b2f133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2009</creationdate><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Bayesian analysis</topic><topic>Categories</topic><topic>Classification</topic><topic>classifier design and evaluation</topic><topic>clustering</topic><topic>Communication systems</topic><topic>Communications systems</topic><topic>Computer science; control theory; systems</topic><topic>Data communications</topic><topic>data compaction and compression</topic><topic>Data processing. List processing. Character string processing</topic><topic>Decoding</topic><topic>Exact sciences and technology</topic><topic>feature evaluation and selection</topic><topic>Filtering</topic><topic>Filters</topic><topic>Gaussian</topic><topic>Indexing</topic><topic>Memory organisation. Data processing</topic><topic>Noise reduction</topic><topic>Reduction</topic><topic>Redundancy</topic><topic>Software</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><topic>Studies</topic><topic>Support vector machines</topic><topic>Text categorization</topic><topic>text processing</topic><topic>Texts</topic><topic>Vocabulary</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Capdevila, M.</creatorcontrib><creatorcontrib>Florez, O.W.M.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Capdevila, M.</au><au>Florez, O.W.M.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Communication Perspective on Automatic Text Categorization</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><date>2009-07-01</date><risdate>2009</risdate><volume>21</volume><issue>7</issue><spage>1027</spage><epage>1041</epage><pages>1027-1041</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).</abstract><cop>New York, NY</cop><pub>IEEE</pub><doi>10.1109/TKDE.2009.22</doi><tpages>15</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1041-4347
ispartof IEEE transactions on knowledge and data engineering, 2009-07, Vol.21 (7), p.1027-1041
issn 1041-4347
1558-2191
language eng
recordid cdi_pascalfrancis_primary_21806130
source IEEE Electronic Library (IEL)
subjects Applied sciences
Artificial intelligence
Bayesian analysis
Categories
Classification
classifier design and evaluation
clustering
Communication systems
Communications systems
Computer science
control theory
systems
Data communications
data compaction and compression
Data processing. List processing. Character string processing
Decoding
Exact sciences and technology
feature evaluation and selection
Filtering
Filters
Gaussian
Indexing
Memory organisation. Data processing
Noise reduction
Reduction
Redundancy
Software
Speech and sound recognition and synthesis. Linguistics
Studies
Support vector machines
Text categorization
text processing
Texts
Vocabulary
title A Communication Perspective on Automatic Text Categorization
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T07%3A40%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Communication%20Perspective%20on%20Automatic%20Text%20Categorization&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Capdevila,%20M.&rft.date=2009-07-01&rft.volume=21&rft.issue=7&rft.spage=1027&rft.epage=1041&rft.pages=1027-1041&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2009.22&rft_dat=%3Cproquest_RIE%3E2543469371%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=912031845&rft_id=info:pmid/&rft_ieee_id=4752825&rfr_iscdi=true