Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods

The recent years have witnessed an increasing use of automated text clustering approaches and more particularly Vector Space Clustering (VSC) methods in the computational analysis of literary data including genre classification, theme analysis, stylometry, and authorship attribution. In spite of the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of advanced computer science & applications 2020, Vol.11 (2)
1. Verfasser: Omar, Abdulfattah
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 2
container_start_page
container_title International journal of advanced computer science & applications
container_volume 11
creator Omar, Abdulfattah
description The recent years have witnessed an increasing use of automated text clustering approaches and more particularly Vector Space Clustering (VSC) methods in the computational analysis of literary data including genre classification, theme analysis, stylometry, and authorship attribution. In spite of the effectiveness of VSC methods in resolving different problems in these disciplines and providing evidence-based research findings, the problem of feature selection remains a challenging one. For reliable text clustering applications, a clustering structure should be based on only and all the most distinctive features within a corpus. Although different term weighting approaches have been developed, the problem of identifying the most distinctive variables within a corpus remains challenging especially in the document clustering applications of literary texts. For this purpose, this study proposes a hybrid of statistical measures including variance analysis, term frequency-inverse document frequency, TF-IDF, and Principal Component Analysis (PCA) for selecting only and all the most distinctive features that can be usefully used for generating more reliable document clustering that can be usefully used in authorship attribution tasks. The study is based on a corpus of 74 novels written by 18 novelists representing different literary traditions. Results indicate that the proposed model proved effective in the successful extraction of the most distinctive features within the datasets and thus generating reliable clustering structures that can be usefully used in different computational applications of literary texts.
doi_str_mv 10.14569/IJACSA.2020.0110214
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2655156060</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2655156060</sourcerecordid><originalsourceid>FETCH-LOGICAL-c325t-8125a440e15e6452cfdedd141acad763471b5622ab448e22d798aa722c1d794a3</originalsourceid><addsrcrecordid>eNotkNFPwjAQxhujiQT5D3xo4vOw7dpu-LYsIhiMD2D0rSnbDUrGNtsukf_eDriX-5Lvu7vcD6FHSqaUCzl7Xr5n-TqbMsLIlFBKGOU3aMSokJEQCbk96zSiJPm5RxPnDiRUPGMyjUdoNwftewt4DTUU3rQNNg3ewJ_Hed07D9Y0O5x1XW0KPdgOtxVemWBoezoH3QvO8OK0taYcvA3YI_4Gs9v7YfQD_L4t3QO6q3TtYHLtY_Q1f93ki2j1-bbMs1VUxEz4KKVMaM4JUAGSC1ZUJZQl5VQXukxkzBO6FZIxveU8BcbKZJZqnTBW0CC5jsfo6bK3s-1vD86rQ9vbJpxUTAoRoBBJQopfUoVtnbNQqc6aY3hIUaLOVNWFqhqoqivV-B9yJWpV</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2655156060</pqid></control><display><type>article</type><title>Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods</title><source>EZB-FREE-00999 freely available EZB journals</source><creator>Omar, Abdulfattah</creator><creatorcontrib>Omar, Abdulfattah</creatorcontrib><description>The recent years have witnessed an increasing use of automated text clustering approaches and more particularly Vector Space Clustering (VSC) methods in the computational analysis of literary data including genre classification, theme analysis, stylometry, and authorship attribution. In spite of the effectiveness of VSC methods in resolving different problems in these disciplines and providing evidence-based research findings, the problem of feature selection remains a challenging one. For reliable text clustering applications, a clustering structure should be based on only and all the most distinctive features within a corpus. Although different term weighting approaches have been developed, the problem of identifying the most distinctive variables within a corpus remains challenging especially in the document clustering applications of literary texts. For this purpose, this study proposes a hybrid of statistical measures including variance analysis, term frequency-inverse document frequency, TF-IDF, and Principal Component Analysis (PCA) for selecting only and all the most distinctive features that can be usefully used for generating more reliable document clustering that can be usefully used in authorship attribution tasks. The study is based on a corpus of 74 novels written by 18 novelists representing different literary traditions. Results indicate that the proposed model proved effective in the successful extraction of the most distinctive features within the datasets and thus generating reliable clustering structures that can be usefully used in different computational applications of literary texts.</description><identifier>ISSN: 2158-107X</identifier><identifier>EISSN: 2156-5570</identifier><identifier>DOI: 10.14569/IJACSA.2020.0110214</identifier><language>eng</language><publisher>West Yorkshire: Science and Information (SAI) Organization Limited</publisher><subject>Authorship ; Clustering ; Documents ; Feature extraction ; Feature selection ; Frequency analysis ; Principal components analysis ; Texts ; Variance analysis ; Vector spaces ; Weighting methods</subject><ispartof>International journal of advanced computer science &amp; applications, 2020, Vol.11 (2)</ispartof><rights>2020. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c325t-8125a440e15e6452cfdedd141acad763471b5622ab448e22d798aa722c1d794a3</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,4010,27900,27901,27902</link.rule.ids></links><search><creatorcontrib>Omar, Abdulfattah</creatorcontrib><title>Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods</title><title>International journal of advanced computer science &amp; applications</title><description>The recent years have witnessed an increasing use of automated text clustering approaches and more particularly Vector Space Clustering (VSC) methods in the computational analysis of literary data including genre classification, theme analysis, stylometry, and authorship attribution. In spite of the effectiveness of VSC methods in resolving different problems in these disciplines and providing evidence-based research findings, the problem of feature selection remains a challenging one. For reliable text clustering applications, a clustering structure should be based on only and all the most distinctive features within a corpus. Although different term weighting approaches have been developed, the problem of identifying the most distinctive variables within a corpus remains challenging especially in the document clustering applications of literary texts. For this purpose, this study proposes a hybrid of statistical measures including variance analysis, term frequency-inverse document frequency, TF-IDF, and Principal Component Analysis (PCA) for selecting only and all the most distinctive features that can be usefully used for generating more reliable document clustering that can be usefully used in authorship attribution tasks. The study is based on a corpus of 74 novels written by 18 novelists representing different literary traditions. Results indicate that the proposed model proved effective in the successful extraction of the most distinctive features within the datasets and thus generating reliable clustering structures that can be usefully used in different computational applications of literary texts.</description><subject>Authorship</subject><subject>Clustering</subject><subject>Documents</subject><subject>Feature extraction</subject><subject>Feature selection</subject><subject>Frequency analysis</subject><subject>Principal components analysis</subject><subject>Texts</subject><subject>Variance analysis</subject><subject>Vector spaces</subject><subject>Weighting methods</subject><issn>2158-107X</issn><issn>2156-5570</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>BENPR</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNotkNFPwjAQxhujiQT5D3xo4vOw7dpu-LYsIhiMD2D0rSnbDUrGNtsukf_eDriX-5Lvu7vcD6FHSqaUCzl7Xr5n-TqbMsLIlFBKGOU3aMSokJEQCbk96zSiJPm5RxPnDiRUPGMyjUdoNwftewt4DTUU3rQNNg3ewJ_Hed07D9Y0O5x1XW0KPdgOtxVemWBoezoH3QvO8OK0taYcvA3YI_4Gs9v7YfQD_L4t3QO6q3TtYHLtY_Q1f93ki2j1-bbMs1VUxEz4KKVMaM4JUAGSC1ZUJZQl5VQXukxkzBO6FZIxveU8BcbKZJZqnTBW0CC5jsfo6bK3s-1vD86rQ9vbJpxUTAoRoBBJQopfUoVtnbNQqc6aY3hIUaLOVNWFqhqoqivV-B9yJWpV</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Omar, Abdulfattah</creator><general>Science and Information (SAI) Organization Limited</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7XB</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope></search><sort><creationdate>2020</creationdate><title>Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods</title><author>Omar, Abdulfattah</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c325t-8125a440e15e6452cfdedd141acad763471b5622ab448e22d798aa722c1d794a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Authorship</topic><topic>Clustering</topic><topic>Documents</topic><topic>Feature extraction</topic><topic>Feature selection</topic><topic>Frequency analysis</topic><topic>Principal components analysis</topic><topic>Texts</topic><topic>Variance analysis</topic><topic>Vector spaces</topic><topic>Weighting methods</topic><toplevel>online_resources</toplevel><creatorcontrib>Omar, Abdulfattah</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><jtitle>International journal of advanced computer science &amp; applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Omar, Abdulfattah</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods</atitle><jtitle>International journal of advanced computer science &amp; applications</jtitle><date>2020</date><risdate>2020</risdate><volume>11</volume><issue>2</issue><issn>2158-107X</issn><eissn>2156-5570</eissn><abstract>The recent years have witnessed an increasing use of automated text clustering approaches and more particularly Vector Space Clustering (VSC) methods in the computational analysis of literary data including genre classification, theme analysis, stylometry, and authorship attribution. In spite of the effectiveness of VSC methods in resolving different problems in these disciplines and providing evidence-based research findings, the problem of feature selection remains a challenging one. For reliable text clustering applications, a clustering structure should be based on only and all the most distinctive features within a corpus. Although different term weighting approaches have been developed, the problem of identifying the most distinctive variables within a corpus remains challenging especially in the document clustering applications of literary texts. For this purpose, this study proposes a hybrid of statistical measures including variance analysis, term frequency-inverse document frequency, TF-IDF, and Principal Component Analysis (PCA) for selecting only and all the most distinctive features that can be usefully used for generating more reliable document clustering that can be usefully used in authorship attribution tasks. The study is based on a corpus of 74 novels written by 18 novelists representing different literary traditions. Results indicate that the proposed model proved effective in the successful extraction of the most distinctive features within the datasets and thus generating reliable clustering structures that can be usefully used in different computational applications of literary texts.</abstract><cop>West Yorkshire</cop><pub>Science and Information (SAI) Organization Limited</pub><doi>10.14569/IJACSA.2020.0110214</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2158-107X
ispartof International journal of advanced computer science & applications, 2020, Vol.11 (2)
issn 2158-107X
2156-5570
language eng
recordid cdi_proquest_journals_2655156060
source EZB-FREE-00999 freely available EZB journals
subjects Authorship
Clustering
Documents
Feature extraction
Feature selection
Frequency analysis
Principal components analysis
Texts
Variance analysis
Vector spaces
Weighting methods
title Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T14%3A42%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20Selection%20in%20Text%20Clustering%20Applications%20of%20Literary%20Texts:%20A%20Hybrid%20of%20Term%20Weighting%20Methods&rft.jtitle=International%20journal%20of%20advanced%20computer%20science%20&%20applications&rft.au=Omar,%20Abdulfattah&rft.date=2020&rft.volume=11&rft.issue=2&rft.issn=2158-107X&rft.eissn=2156-5570&rft_id=info:doi/10.14569/IJACSA.2020.0110214&rft_dat=%3Cproquest_cross%3E2655156060%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2655156060&rft_id=info:pmid/&rfr_iscdi=true