Relative discrimination criterion – A novel feature ranking method for text data

•Discussed characteristics of text data.•Indicated that term counts are being ignored to calculated term rank.•Proposed new feature ranking algorithm (RDC) which considers term counts.•Compared performance of RDC with four feature ranking metrics on four datasets.•RDC show highest performance in 65%...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2015-05, Vol.42 (7), p.3670-3681
Hauptverfasser: Rehman, Abdur, Javed, Kashif, Babri, Haroon A., Saeed, Mehreen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3681
container_issue 7
container_start_page 3670
container_title Expert systems with applications
container_volume 42
creator Rehman, Abdur
Javed, Kashif
Babri, Haroon A.
Saeed, Mehreen
description •Discussed characteristics of text data.•Indicated that term counts are being ignored to calculated term rank.•Proposed new feature ranking algorithm (RDC) which considers term counts.•Compared performance of RDC with four feature ranking metrics on four datasets.•RDC show highest performance in 65% of the classification cases. High dimensionality of text data hinders the performance of classifiers making it necessary to apply feature selection for dimensionality reduction. Most of the feature ranking metrics for text classification are based on document frequencies (df) of a term in positive and negative classes. Considering only document frequencies to rank features favors terms frequently occurring in larger classes in unbalanced datasets. In this paper we introduce a new feature ranking metric termed as relative discrimination criterion (RDC), which takes document frequencies for each term count of a term into account while estimating the usefulness of a term. The performance of RDC is compared with four well known feature ranking metrics, information gain (IG), CHI squared (CHI), odds ratio (OR) and distinguishing feature selector (DFS) using support vector machines (SVM) and multinomial naive Bayes (MNB) classifiers on four benchmark datasets, namely Reuters, 20 Newsgroups and two subsets of Ohsumed dataset. Our results based on macro and micro F1 measures show that the performance of RDC is superior than the other four metrics in 65% of our experimental trials. Also, RDC attains highest macro and micro F1 values in 69% of the cases.
doi_str_mv 10.1016/j.eswa.2014.12.013
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1825461486</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S095741741400791X</els_id><sourcerecordid>1825461486</sourcerecordid><originalsourceid>FETCH-LOGICAL-c333t-1048bda948e143ea7a203c2eb4c22654fd95a3c9bc4a19c7e959e39b4169a5133</originalsourceid><addsrcrecordid>eNp9kM1OwzAQhC0EEqXwApx85JLgtZ0fS1yqij-pEhKCs-U4G3BJk2K7BW68A2_Ik5BQzpx2VpoZaT5CToGlwCA_X6YY3kzKGcgUeMpA7JEJlIVI8kKJfTJhKisSCYU8JEchLBmDgrFiQu7vsTXRbZHWLljvVq4b3r6jg47oR_X9-UVntOu32NIGTdx4pN50L657oiuMz31Nm97TiO-R1iaaY3LQmDbgyd-dksery4f5TbK4u76dzxaJFULEBJgsq9ooWSJIgaYwnAnLsZKW8zyTTa0yI6yqrDSgbIEqUyhUJSFXJgMhpuRs17v2_esGQ9SrYQK2remw3wQNJc9kDrLMByvfWa3vQ_DY6PUw1fgPDUyPAPVSjwD1CFAD1-y3_2IXwmHE1qHXwTrsLNbOo4267t1_8R-bJnqs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1825461486</pqid></control><display><type>article</type><title>Relative discrimination criterion – A novel feature ranking method for text data</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Rehman, Abdur ; Javed, Kashif ; Babri, Haroon A. ; Saeed, Mehreen</creator><creatorcontrib>Rehman, Abdur ; Javed, Kashif ; Babri, Haroon A. ; Saeed, Mehreen</creatorcontrib><description>•Discussed characteristics of text data.•Indicated that term counts are being ignored to calculated term rank.•Proposed new feature ranking algorithm (RDC) which considers term counts.•Compared performance of RDC with four feature ranking metrics on four datasets.•RDC show highest performance in 65% of the classification cases. High dimensionality of text data hinders the performance of classifiers making it necessary to apply feature selection for dimensionality reduction. Most of the feature ranking metrics for text classification are based on document frequencies (df) of a term in positive and negative classes. Considering only document frequencies to rank features favors terms frequently occurring in larger classes in unbalanced datasets. In this paper we introduce a new feature ranking metric termed as relative discrimination criterion (RDC), which takes document frequencies for each term count of a term into account while estimating the usefulness of a term. The performance of RDC is compared with four well known feature ranking metrics, information gain (IG), CHI squared (CHI), odds ratio (OR) and distinguishing feature selector (DFS) using support vector machines (SVM) and multinomial naive Bayes (MNB) classifiers on four benchmark datasets, namely Reuters, 20 Newsgroups and two subsets of Ohsumed dataset. Our results based on macro and micro F1 measures show that the performance of RDC is superior than the other four metrics in 65% of our experimental trials. Also, RDC attains highest macro and micro F1 values in 69% of the cases.</description><identifier>ISSN: 0957-4174</identifier><identifier>EISSN: 1873-6793</identifier><identifier>DOI: 10.1016/j.eswa.2014.12.013</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Classifiers ; Criteria ; Discrimination ; Document frequency ; Expert systems ; False positive rate ; Feature selection ; Ranking ; Selectors ; Support vector machines ; Term count ; Text classification ; Texts ; True positive rate</subject><ispartof>Expert systems with applications, 2015-05, Vol.42 (7), p.3670-3681</ispartof><rights>2014 Elsevier Ltd</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c333t-1048bda948e143ea7a203c2eb4c22654fd95a3c9bc4a19c7e959e39b4169a5133</citedby><cites>FETCH-LOGICAL-c333t-1048bda948e143ea7a203c2eb4c22654fd95a3c9bc4a19c7e959e39b4169a5133</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.eswa.2014.12.013$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids></links><search><creatorcontrib>Rehman, Abdur</creatorcontrib><creatorcontrib>Javed, Kashif</creatorcontrib><creatorcontrib>Babri, Haroon A.</creatorcontrib><creatorcontrib>Saeed, Mehreen</creatorcontrib><title>Relative discrimination criterion – A novel feature ranking method for text data</title><title>Expert systems with applications</title><description>•Discussed characteristics of text data.•Indicated that term counts are being ignored to calculated term rank.•Proposed new feature ranking algorithm (RDC) which considers term counts.•Compared performance of RDC with four feature ranking metrics on four datasets.•RDC show highest performance in 65% of the classification cases. High dimensionality of text data hinders the performance of classifiers making it necessary to apply feature selection for dimensionality reduction. Most of the feature ranking metrics for text classification are based on document frequencies (df) of a term in positive and negative classes. Considering only document frequencies to rank features favors terms frequently occurring in larger classes in unbalanced datasets. In this paper we introduce a new feature ranking metric termed as relative discrimination criterion (RDC), which takes document frequencies for each term count of a term into account while estimating the usefulness of a term. The performance of RDC is compared with four well known feature ranking metrics, information gain (IG), CHI squared (CHI), odds ratio (OR) and distinguishing feature selector (DFS) using support vector machines (SVM) and multinomial naive Bayes (MNB) classifiers on four benchmark datasets, namely Reuters, 20 Newsgroups and two subsets of Ohsumed dataset. Our results based on macro and micro F1 measures show that the performance of RDC is superior than the other four metrics in 65% of our experimental trials. Also, RDC attains highest macro and micro F1 values in 69% of the cases.</description><subject>Classifiers</subject><subject>Criteria</subject><subject>Discrimination</subject><subject>Document frequency</subject><subject>Expert systems</subject><subject>False positive rate</subject><subject>Feature selection</subject><subject>Ranking</subject><subject>Selectors</subject><subject>Support vector machines</subject><subject>Term count</subject><subject>Text classification</subject><subject>Texts</subject><subject>True positive rate</subject><issn>0957-4174</issn><issn>1873-6793</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNp9kM1OwzAQhC0EEqXwApx85JLgtZ0fS1yqij-pEhKCs-U4G3BJk2K7BW68A2_Ik5BQzpx2VpoZaT5CToGlwCA_X6YY3kzKGcgUeMpA7JEJlIVI8kKJfTJhKisSCYU8JEchLBmDgrFiQu7vsTXRbZHWLljvVq4b3r6jg47oR_X9-UVntOu32NIGTdx4pN50L657oiuMz31Nm97TiO-R1iaaY3LQmDbgyd-dksery4f5TbK4u76dzxaJFULEBJgsq9ooWSJIgaYwnAnLsZKW8zyTTa0yI6yqrDSgbIEqUyhUJSFXJgMhpuRs17v2_esGQ9SrYQK2remw3wQNJc9kDrLMByvfWa3vQ_DY6PUw1fgPDUyPAPVSjwD1CFAD1-y3_2IXwmHE1qHXwTrsLNbOo4267t1_8R-bJnqs</recordid><startdate>20150501</startdate><enddate>20150501</enddate><creator>Rehman, Abdur</creator><creator>Javed, Kashif</creator><creator>Babri, Haroon A.</creator><creator>Saeed, Mehreen</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20150501</creationdate><title>Relative discrimination criterion – A novel feature ranking method for text data</title><author>Rehman, Abdur ; Javed, Kashif ; Babri, Haroon A. ; Saeed, Mehreen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c333t-1048bda948e143ea7a203c2eb4c22654fd95a3c9bc4a19c7e959e39b4169a5133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Classifiers</topic><topic>Criteria</topic><topic>Discrimination</topic><topic>Document frequency</topic><topic>Expert systems</topic><topic>False positive rate</topic><topic>Feature selection</topic><topic>Ranking</topic><topic>Selectors</topic><topic>Support vector machines</topic><topic>Term count</topic><topic>Text classification</topic><topic>Texts</topic><topic>True positive rate</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Rehman, Abdur</creatorcontrib><creatorcontrib>Javed, Kashif</creatorcontrib><creatorcontrib>Babri, Haroon A.</creatorcontrib><creatorcontrib>Saeed, Mehreen</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Expert systems with applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Rehman, Abdur</au><au>Javed, Kashif</au><au>Babri, Haroon A.</au><au>Saeed, Mehreen</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Relative discrimination criterion – A novel feature ranking method for text data</atitle><jtitle>Expert systems with applications</jtitle><date>2015-05-01</date><risdate>2015</risdate><volume>42</volume><issue>7</issue><spage>3670</spage><epage>3681</epage><pages>3670-3681</pages><issn>0957-4174</issn><eissn>1873-6793</eissn><abstract>•Discussed characteristics of text data.•Indicated that term counts are being ignored to calculated term rank.•Proposed new feature ranking algorithm (RDC) which considers term counts.•Compared performance of RDC with four feature ranking metrics on four datasets.•RDC show highest performance in 65% of the classification cases. High dimensionality of text data hinders the performance of classifiers making it necessary to apply feature selection for dimensionality reduction. Most of the feature ranking metrics for text classification are based on document frequencies (df) of a term in positive and negative classes. Considering only document frequencies to rank features favors terms frequently occurring in larger classes in unbalanced datasets. In this paper we introduce a new feature ranking metric termed as relative discrimination criterion (RDC), which takes document frequencies for each term count of a term into account while estimating the usefulness of a term. The performance of RDC is compared with four well known feature ranking metrics, information gain (IG), CHI squared (CHI), odds ratio (OR) and distinguishing feature selector (DFS) using support vector machines (SVM) and multinomial naive Bayes (MNB) classifiers on four benchmark datasets, namely Reuters, 20 Newsgroups and two subsets of Ohsumed dataset. Our results based on macro and micro F1 measures show that the performance of RDC is superior than the other four metrics in 65% of our experimental trials. Also, RDC attains highest macro and micro F1 values in 69% of the cases.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.eswa.2014.12.013</doi><tpages>12</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0957-4174
ispartof Expert systems with applications, 2015-05, Vol.42 (7), p.3670-3681
issn 0957-4174
1873-6793
language eng
recordid cdi_proquest_miscellaneous_1825461486
source Elsevier ScienceDirect Journals Complete
subjects Classifiers
Criteria
Discrimination
Document frequency
Expert systems
False positive rate
Feature selection
Ranking
Selectors
Support vector machines
Term count
Text classification
Texts
True positive rate
title Relative discrimination criterion – A novel feature ranking method for text data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T17%3A51%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Relative%20discrimination%20criterion%20%E2%80%93%20A%20novel%20feature%20ranking%20method%20for%20text%20data&rft.jtitle=Expert%20systems%20with%20applications&rft.au=Rehman,%20Abdur&rft.date=2015-05-01&rft.volume=42&rft.issue=7&rft.spage=3670&rft.epage=3681&rft.pages=3670-3681&rft.issn=0957-4174&rft.eissn=1873-6793&rft_id=info:doi/10.1016/j.eswa.2014.12.013&rft_dat=%3Cproquest_cross%3E1825461486%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1825461486&rft_id=info:pmid/&rft_els_id=S095741741400791X&rfr_iscdi=true