Comparison of text feature selection policies and using an adaptive framework

► A comprehensive analysis of feature selection metrics is given. ► New feature selection metrics are introduced. ► Adaptive keyword selection method is proposed. ► Local and global feature selection performances are compared. Text categorization is the task of automatically assigning unlabeled text...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2013-09, Vol.40 (12), p.4871-4886
Hauptverfasser: Tasci, S, Guengoer, T
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 4886
container_issue 12
container_start_page 4871
container_title Expert systems with applications
container_volume 40
creator Tasci, S
Guengoer, T
description ► A comprehensive analysis of feature selection metrics is given. ► New feature selection metrics are introduced. ► Adaptive keyword selection method is proposed. ► Local and global feature selection performances are compared. Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, often feature selection is used for reducing the dimensionality. In this paper, we make an evaluation and comparison of the feature selection policies used in text categorization by employing some of the popular feature selection metrics. For the experiments, we use datasets which vary in size, complexity, and skewness. We use support vector machine as the classifier and tf-idf weighting for weighting the terms. In addition to the evaluation of the policies, we propose new feature selection metrics which show high success rates especially with low number of keywords. These metrics are two-sided local metrics and are based on the difference of the distributions of a term in the documents belonging to a class and in the documents not belonging to that class. Moreover, we propose a keyword selection framework called adaptive keyword selection. It is based on selecting different number of terms for each class and it shows significant improvement on skewed datasets that have a limited number of training instances for some of the classes.
doi_str_mv 10.1016/j.eswa.2013.02.019
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1701124612</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0957417413001358</els_id><sourcerecordid>1530986761</sourcerecordid><originalsourceid>FETCH-LOGICAL-c392t-f0715c78ce9a33f81103f0299c2cd216f3052e2b5d9558bfa924fd55cece19963</originalsourceid><addsrcrecordid>eNqNkcFO3DAQhi1UJLa0L9BTLki9JMzYcRxLXNCqFCQqLuVsGWdceZuNg52F8vb1ahHH0ottyZ9nxv_H2BeEBgG7801D-dk2HFA0wBtAfcRW2CtRd0qLD2wFWqq6RdWesI85bwBQAagV-7GO29mmkONURV8t9GepPNlll6jKNJJbQrmZ4xhcoFzZaah2OUy_yqmyg52X8ESVT3ZLzzH9_sSOvR0zfX7dT9n91bef6-v69u77zfrytnZC86X2oFA61TvSVgjfI4LwwLV23A0cOy9AcuIPctBS9g_eat76QUpHjlDrTpyyr4e6c4qPO8qL2YbsaBztRHGXTfkcIm875P-FQitKYu-jUoDuO9Xh-2jb9qosaj8AP6AuxZwTeTOnsLXpxSCYvTyzMXt5Zi_PADeHUc5e69vs7FgCnlzIby-5Eq3m2Bfu4sBRSfspUDK5eJocDSEVd2aI4V9t_gKVb645</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1448714472</pqid></control><display><type>article</type><title>Comparison of text feature selection policies and using an adaptive framework</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Tasci, S ; Guengoer, T</creator><creatorcontrib>Tasci, S ; Guengoer, T</creatorcontrib><description>► A comprehensive analysis of feature selection metrics is given. ► New feature selection metrics are introduced. ► Adaptive keyword selection method is proposed. ► Local and global feature selection performances are compared. Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, often feature selection is used for reducing the dimensionality. In this paper, we make an evaluation and comparison of the feature selection policies used in text categorization by employing some of the popular feature selection metrics. For the experiments, we use datasets which vary in size, complexity, and skewness. We use support vector machine as the classifier and tf-idf weighting for weighting the terms. In addition to the evaluation of the policies, we propose new feature selection metrics which show high success rates especially with low number of keywords. These metrics are two-sided local metrics and are based on the difference of the distributions of a term in the documents belonging to a class and in the documents not belonging to that class. Moreover, we propose a keyword selection framework called adaptive keyword selection. It is based on selecting different number of terms for each class and it shows significant improvement on skewed datasets that have a limited number of training instances for some of the classes.</description><identifier>ISSN: 0957-4174</identifier><identifier>EISSN: 1873-6793</identifier><identifier>DOI: 10.1016/j.eswa.2013.02.019</identifier><language>eng</language><publisher>Amsterdam: Elsevier Ltd</publisher><subject>Adaptive keyword selection ; Algorithms ; Applied sciences ; Artificial intelligence ; Categories ; Computer science; control theory; systems ; Data processing. List processing. Character string processing ; Document categorization ; Exact sciences and technology ; Expert systems ; Feature selection ; Local and global policies ; Memory organisation. Data processing ; Policies ; Software ; Speech and sound recognition and synthesis. Linguistics ; Support vector machines ; Tasks ; Texts ; Weighting</subject><ispartof>Expert systems with applications, 2013-09, Vol.40 (12), p.4871-4886</ispartof><rights>2013 Elsevier Ltd</rights><rights>2014 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c392t-f0715c78ce9a33f81103f0299c2cd216f3052e2b5d9558bfa924fd55cece19963</citedby><cites>FETCH-LOGICAL-c392t-f0715c78ce9a33f81103f0299c2cd216f3052e2b5d9558bfa924fd55cece19963</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.eswa.2013.02.019$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=27349218$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Tasci, S</creatorcontrib><creatorcontrib>Guengoer, T</creatorcontrib><title>Comparison of text feature selection policies and using an adaptive framework</title><title>Expert systems with applications</title><description>► A comprehensive analysis of feature selection metrics is given. ► New feature selection metrics are introduced. ► Adaptive keyword selection method is proposed. ► Local and global feature selection performances are compared. Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, often feature selection is used for reducing the dimensionality. In this paper, we make an evaluation and comparison of the feature selection policies used in text categorization by employing some of the popular feature selection metrics. For the experiments, we use datasets which vary in size, complexity, and skewness. We use support vector machine as the classifier and tf-idf weighting for weighting the terms. In addition to the evaluation of the policies, we propose new feature selection metrics which show high success rates especially with low number of keywords. These metrics are two-sided local metrics and are based on the difference of the distributions of a term in the documents belonging to a class and in the documents not belonging to that class. Moreover, we propose a keyword selection framework called adaptive keyword selection. It is based on selecting different number of terms for each class and it shows significant improvement on skewed datasets that have a limited number of training instances for some of the classes.</description><subject>Adaptive keyword selection</subject><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Categories</subject><subject>Computer science; control theory; systems</subject><subject>Data processing. List processing. Character string processing</subject><subject>Document categorization</subject><subject>Exact sciences and technology</subject><subject>Expert systems</subject><subject>Feature selection</subject><subject>Local and global policies</subject><subject>Memory organisation. Data processing</subject><subject>Policies</subject><subject>Software</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><subject>Support vector machines</subject><subject>Tasks</subject><subject>Texts</subject><subject>Weighting</subject><issn>0957-4174</issn><issn>1873-6793</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNqNkcFO3DAQhi1UJLa0L9BTLki9JMzYcRxLXNCqFCQqLuVsGWdceZuNg52F8vb1ahHH0ottyZ9nxv_H2BeEBgG7801D-dk2HFA0wBtAfcRW2CtRd0qLD2wFWqq6RdWesI85bwBQAagV-7GO29mmkONURV8t9GepPNlll6jKNJJbQrmZ4xhcoFzZaah2OUy_yqmyg52X8ESVT3ZLzzH9_sSOvR0zfX7dT9n91bef6-v69u77zfrytnZC86X2oFA61TvSVgjfI4LwwLV23A0cOy9AcuIPctBS9g_eat76QUpHjlDrTpyyr4e6c4qPO8qL2YbsaBztRHGXTfkcIm875P-FQitKYu-jUoDuO9Xh-2jb9qosaj8AP6AuxZwTeTOnsLXpxSCYvTyzMXt5Zi_PADeHUc5e69vs7FgCnlzIby-5Eq3m2Bfu4sBRSfspUDK5eJocDSEVd2aI4V9t_gKVb645</recordid><startdate>20130915</startdate><enddate>20130915</enddate><creator>Tasci, S</creator><creator>Guengoer, T</creator><general>Elsevier Ltd</general><general>Elsevier</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20130915</creationdate><title>Comparison of text feature selection policies and using an adaptive framework</title><author>Tasci, S ; Guengoer, T</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c392t-f0715c78ce9a33f81103f0299c2cd216f3052e2b5d9558bfa924fd55cece19963</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Adaptive keyword selection</topic><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Categories</topic><topic>Computer science; control theory; systems</topic><topic>Data processing. List processing. Character string processing</topic><topic>Document categorization</topic><topic>Exact sciences and technology</topic><topic>Expert systems</topic><topic>Feature selection</topic><topic>Local and global policies</topic><topic>Memory organisation. Data processing</topic><topic>Policies</topic><topic>Software</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><topic>Support vector machines</topic><topic>Tasks</topic><topic>Texts</topic><topic>Weighting</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tasci, S</creatorcontrib><creatorcontrib>Guengoer, T</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Expert systems with applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tasci, S</au><au>Guengoer, T</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Comparison of text feature selection policies and using an adaptive framework</atitle><jtitle>Expert systems with applications</jtitle><date>2013-09-15</date><risdate>2013</risdate><volume>40</volume><issue>12</issue><spage>4871</spage><epage>4886</epage><pages>4871-4886</pages><issn>0957-4174</issn><eissn>1873-6793</eissn><abstract>► A comprehensive analysis of feature selection metrics is given. ► New feature selection metrics are introduced. ► Adaptive keyword selection method is proposed. ► Local and global feature selection performances are compared. Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, often feature selection is used for reducing the dimensionality. In this paper, we make an evaluation and comparison of the feature selection policies used in text categorization by employing some of the popular feature selection metrics. For the experiments, we use datasets which vary in size, complexity, and skewness. We use support vector machine as the classifier and tf-idf weighting for weighting the terms. In addition to the evaluation of the policies, we propose new feature selection metrics which show high success rates especially with low number of keywords. These metrics are two-sided local metrics and are based on the difference of the distributions of a term in the documents belonging to a class and in the documents not belonging to that class. Moreover, we propose a keyword selection framework called adaptive keyword selection. It is based on selecting different number of terms for each class and it shows significant improvement on skewed datasets that have a limited number of training instances for some of the classes.</abstract><cop>Amsterdam</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.eswa.2013.02.019</doi><tpages>16</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0957-4174
ispartof Expert systems with applications, 2013-09, Vol.40 (12), p.4871-4886
issn 0957-4174
1873-6793
language eng
recordid cdi_proquest_miscellaneous_1701124612
source Elsevier ScienceDirect Journals Complete
subjects Adaptive keyword selection
Algorithms
Applied sciences
Artificial intelligence
Categories
Computer science
control theory
systems
Data processing. List processing. Character string processing
Document categorization
Exact sciences and technology
Expert systems
Feature selection
Local and global policies
Memory organisation. Data processing
Policies
Software
Speech and sound recognition and synthesis. Linguistics
Support vector machines
Tasks
Texts
Weighting
title Comparison of text feature selection policies and using an adaptive framework
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T08%3A12%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Comparison%20of%20text%20feature%20selection%20policies%20and%20using%20an%20adaptive%20framework&rft.jtitle=Expert%20systems%20with%20applications&rft.au=Tasci,%20S&rft.date=2013-09-15&rft.volume=40&rft.issue=12&rft.spage=4871&rft.epage=4886&rft.pages=4871-4886&rft.issn=0957-4174&rft.eissn=1873-6793&rft_id=info:doi/10.1016/j.eswa.2013.02.019&rft_dat=%3Cproquest_cross%3E1530986761%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1448714472&rft_id=info:pmid/&rft_els_id=S0957417413001358&rfr_iscdi=true