Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports

With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Mathematical problems in engineering 2021-03, Vol.2021, p.1-30
Hauptverfasser: Jiang, Zhiying, Gao, Bo, He, Yanlin, Han, Yongming, Doyle, Paul, Zhu, Qunxiong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 30
container_issue
container_start_page 1
container_title Mathematical problems in engineering
container_volume 2021
creator Jiang, Zhiying
Gao, Bo
He, Yanlin
Han, Yongming
Doyle, Paul
Zhu, Qunxiong
description With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs DF¯, namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.
doi_str_mv 10.1155/2021/6619088
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2501177244</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2501177244</sourcerecordid><originalsourceid>FETCH-LOGICAL-c337t-7255f4ad2e36e81b9451e85edaf1f59d5f95578779430ea5abd8df82b2d17c7b3</originalsourceid><addsrcrecordid>eNp90E1PwzAMBuAIgcQY3PgBkThCWZI2TXqEwaDSAAk6wa1KG2fLtLYjyfj493TazpxsWY9s60XonJJrSjkfMcLoKE1pRqQ8QAPK0zjiNBGHfU9YElEWfxyjE--XpJecygEyBfwEPF4p762xtQq2a_HM23aOn7svWOECXIPfwc4XYTt8qxfQQHSrPGicN2vXI42LSZTfTbDpHM7bAK6FgJ9AW4VfYd254E_RkVErD2f7OkSzyX0xfoymLw_5-GYa1XEsQiQY5yZRmkGcgqRVlnAKkoNWhhqeaW4yzoUUIktiAoqrSkttJKuYpqIWVTxEF7u9_WOfG_ChXHYb1_YnS8YJpUKwJOnV1U7VrvPegSnXzjbK_ZaUlNsky22S5T7Jnl_u-MK2Wn3b__UfFCNyPg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2501177244</pqid></control><display><type>article</type><title>Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports</title><source>Wiley Online Library Open Access</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Alma/SFX Local Collection</source><creator>Jiang, Zhiying ; Gao, Bo ; He, Yanlin ; Han, Yongming ; Doyle, Paul ; Zhu, Qunxiong</creator><contributor>Zeng, Nianyin</contributor><creatorcontrib>Jiang, Zhiying ; Gao, Bo ; He, Yanlin ; Han, Yongming ; Doyle, Paul ; Zhu, Qunxiong ; Zeng, Nianyin</creatorcontrib><description>With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs DF¯, namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.</description><identifier>ISSN: 1024-123X</identifier><identifier>EISSN: 1563-5147</identifier><identifier>DOI: 10.1155/2021/6619088</identifier><language>eng</language><publisher>New York: Hindawi</publisher><subject>Algorithms ; Classification ; Deep learning ; Information retrieval ; Internet ; Methods ; Neural networks ; Performance evaluation ; Semantic analysis ; Semantics ; Text categorization ; Weighting</subject><ispartof>Mathematical problems in engineering, 2021-03, Vol.2021, p.1-30</ispartof><rights>Copyright © 2021 Zhiying Jiang et al.</rights><rights>Copyright © 2021 Zhiying Jiang et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c337t-7255f4ad2e36e81b9451e85edaf1f59d5f95578779430ea5abd8df82b2d17c7b3</citedby><cites>FETCH-LOGICAL-c337t-7255f4ad2e36e81b9451e85edaf1f59d5f95578779430ea5abd8df82b2d17c7b3</cites><orcidid>0000-0002-2719-0516 ; 0000-0003-3877-7432 ; 0000-0002-4643-7043 ; 0000-0002-0037-6871 ; 0000-0003-3209-725X ; 0000-0001-8840-7056</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><contributor>Zeng, Nianyin</contributor><creatorcontrib>Jiang, Zhiying</creatorcontrib><creatorcontrib>Gao, Bo</creatorcontrib><creatorcontrib>He, Yanlin</creatorcontrib><creatorcontrib>Han, Yongming</creatorcontrib><creatorcontrib>Doyle, Paul</creatorcontrib><creatorcontrib>Zhu, Qunxiong</creatorcontrib><title>Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports</title><title>Mathematical problems in engineering</title><description>With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs DF¯, namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.</description><subject>Algorithms</subject><subject>Classification</subject><subject>Deep learning</subject><subject>Information retrieval</subject><subject>Internet</subject><subject>Methods</subject><subject>Neural networks</subject><subject>Performance evaluation</subject><subject>Semantic analysis</subject><subject>Semantics</subject><subject>Text categorization</subject><subject>Weighting</subject><issn>1024-123X</issn><issn>1563-5147</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RHX</sourceid><sourceid>BENPR</sourceid><recordid>eNp90E1PwzAMBuAIgcQY3PgBkThCWZI2TXqEwaDSAAk6wa1KG2fLtLYjyfj493TazpxsWY9s60XonJJrSjkfMcLoKE1pRqQ8QAPK0zjiNBGHfU9YElEWfxyjE--XpJecygEyBfwEPF4p762xtQq2a_HM23aOn7svWOECXIPfwc4XYTt8qxfQQHSrPGicN2vXI42LSZTfTbDpHM7bAK6FgJ9AW4VfYd254E_RkVErD2f7OkSzyX0xfoymLw_5-GYa1XEsQiQY5yZRmkGcgqRVlnAKkoNWhhqeaW4yzoUUIktiAoqrSkttJKuYpqIWVTxEF7u9_WOfG_ChXHYb1_YnS8YJpUKwJOnV1U7VrvPegSnXzjbK_ZaUlNsky22S5T7Jnl_u-MK2Wn3b__UfFCNyPg</recordid><startdate>20210305</startdate><enddate>20210305</enddate><creator>Jiang, Zhiying</creator><creator>Gao, Bo</creator><creator>He, Yanlin</creator><creator>Han, Yongming</creator><creator>Doyle, Paul</creator><creator>Zhu, Qunxiong</creator><general>Hindawi</general><general>Hindawi Limited</general><scope>RHU</scope><scope>RHW</scope><scope>RHX</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7TB</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CWDGH</scope><scope>DWQXO</scope><scope>FR3</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>KR7</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><orcidid>https://orcid.org/0000-0002-2719-0516</orcidid><orcidid>https://orcid.org/0000-0003-3877-7432</orcidid><orcidid>https://orcid.org/0000-0002-4643-7043</orcidid><orcidid>https://orcid.org/0000-0002-0037-6871</orcidid><orcidid>https://orcid.org/0000-0003-3209-725X</orcidid><orcidid>https://orcid.org/0000-0001-8840-7056</orcidid></search><sort><creationdate>20210305</creationdate><title>Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports</title><author>Jiang, Zhiying ; Gao, Bo ; He, Yanlin ; Han, Yongming ; Doyle, Paul ; Zhu, Qunxiong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c337t-7255f4ad2e36e81b9451e85edaf1f59d5f95578779430ea5abd8df82b2d17c7b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Classification</topic><topic>Deep learning</topic><topic>Information retrieval</topic><topic>Internet</topic><topic>Methods</topic><topic>Neural networks</topic><topic>Performance evaluation</topic><topic>Semantic analysis</topic><topic>Semantics</topic><topic>Text categorization</topic><topic>Weighting</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jiang, Zhiying</creatorcontrib><creatorcontrib>Gao, Bo</creatorcontrib><creatorcontrib>He, Yanlin</creatorcontrib><creatorcontrib>Han, Yongming</creatorcontrib><creatorcontrib>Doyle, Paul</creatorcontrib><creatorcontrib>Zhu, Qunxiong</creatorcontrib><collection>Hindawi Publishing Complete</collection><collection>Hindawi Publishing Subscription Journals</collection><collection>Hindawi Publishing Open Access</collection><collection>CrossRef</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>Middle East &amp; Africa Database</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Civil Engineering Abstracts</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied &amp; Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><jtitle>Mathematical problems in engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jiang, Zhiying</au><au>Gao, Bo</au><au>He, Yanlin</au><au>Han, Yongming</au><au>Doyle, Paul</au><au>Zhu, Qunxiong</au><au>Zeng, Nianyin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports</atitle><jtitle>Mathematical problems in engineering</jtitle><date>2021-03-05</date><risdate>2021</risdate><volume>2021</volume><spage>1</spage><epage>30</epage><pages>1-30</pages><issn>1024-123X</issn><eissn>1563-5147</eissn><abstract>With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs DF¯, namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.</abstract><cop>New York</cop><pub>Hindawi</pub><doi>10.1155/2021/6619088</doi><tpages>30</tpages><orcidid>https://orcid.org/0000-0002-2719-0516</orcidid><orcidid>https://orcid.org/0000-0003-3877-7432</orcidid><orcidid>https://orcid.org/0000-0002-4643-7043</orcidid><orcidid>https://orcid.org/0000-0002-0037-6871</orcidid><orcidid>https://orcid.org/0000-0003-3209-725X</orcidid><orcidid>https://orcid.org/0000-0001-8840-7056</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1024-123X
ispartof Mathematical problems in engineering, 2021-03, Vol.2021, p.1-30
issn 1024-123X
1563-5147
language eng
recordid cdi_proquest_journals_2501177244
source Wiley Online Library Open Access; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Alma/SFX Local Collection
subjects Algorithms
Classification
Deep learning
Information retrieval
Internet
Methods
Neural networks
Performance evaluation
Semantic analysis
Semantics
Text categorization
Weighting
title Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T20%3A31%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Text%20Classification%20Using%20Novel%20Term%20Weighting%20Scheme-Based%20Improved%20TF-IDF%20for%20Internet%20Media%20Reports&rft.jtitle=Mathematical%20problems%20in%20engineering&rft.au=Jiang,%20Zhiying&rft.date=2021-03-05&rft.volume=2021&rft.spage=1&rft.epage=30&rft.pages=1-30&rft.issn=1024-123X&rft.eissn=1563-5147&rft_id=info:doi/10.1155/2021/6619088&rft_dat=%3Cproquest_cross%3E2501177244%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2501177244&rft_id=info:pmid/&rfr_iscdi=true