Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search

The identification of suitable feature subsets from High-Dimensional Low-Sample-Size (HDLSS) data is of paramount importance because this dataset often contains numerous redundant and irrelevant features, leading to poor classification performance. However, the selection of an optimal feature subset...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2024, Vol.12, p.62341-62357
Hauptverfasser: Mandal, Ashis Kumar, Nadim, MD, Saha, Hasi, Sultana, Tangina, Hossain, Md. Delowar, Huh, Eui-Nam
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 62357
container_issue
container_start_page 62341
container_title IEEE access
container_volume 12
creator Mandal, Ashis Kumar
Nadim, MD
Saha, Hasi
Sultana, Tangina
Hossain, Md. Delowar
Huh, Eui-Nam
description The identification of suitable feature subsets from High-Dimensional Low-Sample-Size (HDLSS) data is of paramount importance because this dataset often contains numerous redundant and irrelevant features, leading to poor classification performance. However, the selection of an optimal feature subset from a vast feature space creates a significant computational challenge. In the domain of HDLSS data, conventional feature selection methods often face challenges in achieving a balance between reducing the number of features and preserving high classification accuracy. Addressing these issues, the study introduces an effective framework that employs a filter and wrapper-based strategy specifically designed to address the classification challenges inherent in HDLSS data. The framework adopts a multi-step approach where ensemble feature selection integrates five filter ranking approaches: Chi-square ( \chi ^{2} ), Gini index (GI), F-score, Mutual Information (MI), and Symmetric uncertainty (SU) to identify the top-ranking features. In the subsequent stage, a wrapper-based search method is utilized, which employs the Differential Evaluation (DE) metaheuristic algorithm as the search strategy. The fitness of feature subsets during this search is assessed based on a weighted combination of the error rate of the Support Vector Machine (SVM) classifier and the ratio of feature cardinality. The datasets, after undergoing dimensionality reduction, are then utilized to construct classification models using SVM, K-Nearest Neighbors (KNN), and Logistic Regression (LR). The approach was evaluated on 13 HDLSS datasets to assess its efficacy in selecting appropriate feature subsets and improving Classification Accuracy (ACC) analog with Area Under the Curve (AUC). Results show that the proposed ensemble with wrapper-based approach produces a smaller number of features (ranging between 2 and 9 for all datasets), while maintaining a commendable average AUC and ACC (between 98% and 100%). The comparative analysis reveals that the proposed method surpasses both ensemble feature selection and non-feature selection approaches in terms of feature reduction and ACC. Additionally, when compared to various other state-of-the-art methods, this approach demonstrates commendable performance.
doi_str_mv 10.1109/ACCESS.2024.3390684
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3052194454</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10504829</ieee_id><doaj_id>oai_doaj_org_article_06d6923e3a5e4dc0b17690f2c12390e0</doaj_id><sourcerecordid>3052194454</sourcerecordid><originalsourceid>FETCH-LOGICAL-c359t-5bb37dcda4172aa2e74f5799a34a7e5bda84c06492f6d10864d5dd182c6176bb3</originalsourceid><addsrcrecordid>eNpNkc9q20AQxkVpoSHJE7SHhV4rZ_9Le0wdpwkYelBDjstod2SvkS11V6a0b9E37joKJnuZ5Zv5fQPzFcUnRheMUXNzu1yummbBKZcLIQzVtXxXXHCmTSmU0O_f_D8W1yntaH51llR1Ufy7R5iOEUlzbBNOpMEe3RSGA-mGSB7CZlvehT0eUpag_0rWw2_SwH7sw2FDmvAXyR1MQJY9pBS64OCFfUqn9uqQcN_2SM47zubPYdoSIM8RxhFj-Q0S-tyG6LZXxYcO-oTXr_WyeLpf_Vw-lOsf3x-Xt-vSCWWmUrWtqLzzIFnFAThWslOVMSAkVKhaD7V0VEvDO-0ZrbX0yntWc6dZpTN8WTzOvn6AnR1j2EP8YwcI9kUY4sZCnILr0VLtteECBSiU3tE2Oxjaccd4vjfS7PVl9hrj8OuIabK74RjzwZIVVHFmpFQyT4l5ysUhpYjdeSuj9hSlnaO0pyjta5SZ-jxTARHfEIrKmhvxHx0zmuA</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3052194454</pqid></control><display><type>article</type><title>Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Mandal, Ashis Kumar ; Nadim, MD ; Saha, Hasi ; Sultana, Tangina ; Hossain, Md. Delowar ; Huh, Eui-Nam</creator><creatorcontrib>Mandal, Ashis Kumar ; Nadim, MD ; Saha, Hasi ; Sultana, Tangina ; Hossain, Md. Delowar ; Huh, Eui-Nam</creatorcontrib><description>The identification of suitable feature subsets from High-Dimensional Low-Sample-Size (HDLSS) data is of paramount importance because this dataset often contains numerous redundant and irrelevant features, leading to poor classification performance. However, the selection of an optimal feature subset from a vast feature space creates a significant computational challenge. In the domain of HDLSS data, conventional feature selection methods often face challenges in achieving a balance between reducing the number of features and preserving high classification accuracy. Addressing these issues, the study introduces an effective framework that employs a filter and wrapper-based strategy specifically designed to address the classification challenges inherent in HDLSS data. The framework adopts a multi-step approach where ensemble feature selection integrates five filter ranking approaches: Chi-square (&lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;\chi ^{2} &lt;/tex-math&gt;&lt;/inline-formula&gt;), Gini index (GI), F-score, Mutual Information (MI), and Symmetric uncertainty (SU) to identify the top-ranking features. In the subsequent stage, a wrapper-based search method is utilized, which employs the Differential Evaluation (DE) metaheuristic algorithm as the search strategy. The fitness of feature subsets during this search is assessed based on a weighted combination of the error rate of the Support Vector Machine (SVM) classifier and the ratio of feature cardinality. The datasets, after undergoing dimensionality reduction, are then utilized to construct classification models using SVM, K-Nearest Neighbors (KNN), and Logistic Regression (LR). The approach was evaluated on 13 HDLSS datasets to assess its efficacy in selecting appropriate feature subsets and improving Classification Accuracy (ACC) analog with Area Under the Curve (AUC). Results show that the proposed ensemble with wrapper-based approach produces a smaller number of features (ranging between 2 and 9 for all datasets), while maintaining a commendable average AUC and ACC (between 98% and 100%). The comparative analysis reveals that the proposed method surpasses both ensemble feature selection and non-feature selection approaches in terms of feature reduction and ACC. Additionally, when compared to various other state-of-the-art methods, this approach demonstrates commendable performance.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3390684</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Accuracy ; Algorithms ; Classification ; Classification algorithms ; Datasets ; differential evaluation ; Feature extraction ; Feature selection ; filter approach ; Filtering algorithms ; HDLSS data ; Heuristic methods ; Information filters ; Metaheuristics ; Ranking ; Search methods ; Search problems ; Support vector machines ; wrapper approach</subject><ispartof>IEEE access, 2024, Vol.12, p.62341-62357</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c359t-5bb37dcda4172aa2e74f5799a34a7e5bda84c06492f6d10864d5dd182c6176bb3</cites><orcidid>0000-0003-0184-6975 ; 0000-0002-2658-2745 ; 0000-0002-6080-9720 ; 0009-0002-3169-8352</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10504829$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>315,781,785,865,2103,4025,27638,27928,27929,27930,54938</link.rule.ids></links><search><creatorcontrib>Mandal, Ashis Kumar</creatorcontrib><creatorcontrib>Nadim, MD</creatorcontrib><creatorcontrib>Saha, Hasi</creatorcontrib><creatorcontrib>Sultana, Tangina</creatorcontrib><creatorcontrib>Hossain, Md. Delowar</creatorcontrib><creatorcontrib>Huh, Eui-Nam</creatorcontrib><title>Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search</title><title>IEEE access</title><addtitle>Access</addtitle><description>The identification of suitable feature subsets from High-Dimensional Low-Sample-Size (HDLSS) data is of paramount importance because this dataset often contains numerous redundant and irrelevant features, leading to poor classification performance. However, the selection of an optimal feature subset from a vast feature space creates a significant computational challenge. In the domain of HDLSS data, conventional feature selection methods often face challenges in achieving a balance between reducing the number of features and preserving high classification accuracy. Addressing these issues, the study introduces an effective framework that employs a filter and wrapper-based strategy specifically designed to address the classification challenges inherent in HDLSS data. The framework adopts a multi-step approach where ensemble feature selection integrates five filter ranking approaches: Chi-square (&lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;\chi ^{2} &lt;/tex-math&gt;&lt;/inline-formula&gt;), Gini index (GI), F-score, Mutual Information (MI), and Symmetric uncertainty (SU) to identify the top-ranking features. In the subsequent stage, a wrapper-based search method is utilized, which employs the Differential Evaluation (DE) metaheuristic algorithm as the search strategy. The fitness of feature subsets during this search is assessed based on a weighted combination of the error rate of the Support Vector Machine (SVM) classifier and the ratio of feature cardinality. The datasets, after undergoing dimensionality reduction, are then utilized to construct classification models using SVM, K-Nearest Neighbors (KNN), and Logistic Regression (LR). The approach was evaluated on 13 HDLSS datasets to assess its efficacy in selecting appropriate feature subsets and improving Classification Accuracy (ACC) analog with Area Under the Curve (AUC). Results show that the proposed ensemble with wrapper-based approach produces a smaller number of features (ranging between 2 and 9 for all datasets), while maintaining a commendable average AUC and ACC (between 98% and 100%). The comparative analysis reveals that the proposed method surpasses both ensemble feature selection and non-feature selection approaches in terms of feature reduction and ACC. Additionally, when compared to various other state-of-the-art methods, this approach demonstrates commendable performance.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Classification</subject><subject>Classification algorithms</subject><subject>Datasets</subject><subject>differential evaluation</subject><subject>Feature extraction</subject><subject>Feature selection</subject><subject>filter approach</subject><subject>Filtering algorithms</subject><subject>HDLSS data</subject><subject>Heuristic methods</subject><subject>Information filters</subject><subject>Metaheuristics</subject><subject>Ranking</subject><subject>Search methods</subject><subject>Search problems</subject><subject>Support vector machines</subject><subject>wrapper approach</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNkc9q20AQxkVpoSHJE7SHhV4rZ_9Le0wdpwkYelBDjstod2SvkS11V6a0b9E37joKJnuZ5Zv5fQPzFcUnRheMUXNzu1yummbBKZcLIQzVtXxXXHCmTSmU0O_f_D8W1yntaH51llR1Ufy7R5iOEUlzbBNOpMEe3RSGA-mGSB7CZlvehT0eUpag_0rWw2_SwH7sw2FDmvAXyR1MQJY9pBS64OCFfUqn9uqQcN_2SM47zubPYdoSIM8RxhFj-Q0S-tyG6LZXxYcO-oTXr_WyeLpf_Vw-lOsf3x-Xt-vSCWWmUrWtqLzzIFnFAThWslOVMSAkVKhaD7V0VEvDO-0ZrbX0yntWc6dZpTN8WTzOvn6AnR1j2EP8YwcI9kUY4sZCnILr0VLtteECBSiU3tE2Oxjaccd4vjfS7PVl9hrj8OuIabK74RjzwZIVVHFmpFQyT4l5ysUhpYjdeSuj9hSlnaO0pyjta5SZ-jxTARHfEIrKmhvxHx0zmuA</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Mandal, Ashis Kumar</creator><creator>Nadim, MD</creator><creator>Saha, Hasi</creator><creator>Sultana, Tangina</creator><creator>Hossain, Md. Delowar</creator><creator>Huh, Eui-Nam</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-0184-6975</orcidid><orcidid>https://orcid.org/0000-0002-2658-2745</orcidid><orcidid>https://orcid.org/0000-0002-6080-9720</orcidid><orcidid>https://orcid.org/0009-0002-3169-8352</orcidid></search><sort><creationdate>2024</creationdate><title>Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search</title><author>Mandal, Ashis Kumar ; Nadim, MD ; Saha, Hasi ; Sultana, Tangina ; Hossain, Md. Delowar ; Huh, Eui-Nam</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c359t-5bb37dcda4172aa2e74f5799a34a7e5bda84c06492f6d10864d5dd182c6176bb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Classification</topic><topic>Classification algorithms</topic><topic>Datasets</topic><topic>differential evaluation</topic><topic>Feature extraction</topic><topic>Feature selection</topic><topic>filter approach</topic><topic>Filtering algorithms</topic><topic>HDLSS data</topic><topic>Heuristic methods</topic><topic>Information filters</topic><topic>Metaheuristics</topic><topic>Ranking</topic><topic>Search methods</topic><topic>Search problems</topic><topic>Support vector machines</topic><topic>wrapper approach</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mandal, Ashis Kumar</creatorcontrib><creatorcontrib>Nadim, MD</creatorcontrib><creatorcontrib>Saha, Hasi</creatorcontrib><creatorcontrib>Sultana, Tangina</creatorcontrib><creatorcontrib>Hossain, Md. Delowar</creatorcontrib><creatorcontrib>Huh, Eui-Nam</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mandal, Ashis Kumar</au><au>Nadim, MD</au><au>Saha, Hasi</au><au>Sultana, Tangina</au><au>Hossain, Md. Delowar</au><au>Huh, Eui-Nam</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024</date><risdate>2024</risdate><volume>12</volume><spage>62341</spage><epage>62357</epage><pages>62341-62357</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>The identification of suitable feature subsets from High-Dimensional Low-Sample-Size (HDLSS) data is of paramount importance because this dataset often contains numerous redundant and irrelevant features, leading to poor classification performance. However, the selection of an optimal feature subset from a vast feature space creates a significant computational challenge. In the domain of HDLSS data, conventional feature selection methods often face challenges in achieving a balance between reducing the number of features and preserving high classification accuracy. Addressing these issues, the study introduces an effective framework that employs a filter and wrapper-based strategy specifically designed to address the classification challenges inherent in HDLSS data. The framework adopts a multi-step approach where ensemble feature selection integrates five filter ranking approaches: Chi-square (&lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;\chi ^{2} &lt;/tex-math&gt;&lt;/inline-formula&gt;), Gini index (GI), F-score, Mutual Information (MI), and Symmetric uncertainty (SU) to identify the top-ranking features. In the subsequent stage, a wrapper-based search method is utilized, which employs the Differential Evaluation (DE) metaheuristic algorithm as the search strategy. The fitness of feature subsets during this search is assessed based on a weighted combination of the error rate of the Support Vector Machine (SVM) classifier and the ratio of feature cardinality. The datasets, after undergoing dimensionality reduction, are then utilized to construct classification models using SVM, K-Nearest Neighbors (KNN), and Logistic Regression (LR). The approach was evaluated on 13 HDLSS datasets to assess its efficacy in selecting appropriate feature subsets and improving Classification Accuracy (ACC) analog with Area Under the Curve (AUC). Results show that the proposed ensemble with wrapper-based approach produces a smaller number of features (ranging between 2 and 9 for all datasets), while maintaining a commendable average AUC and ACC (between 98% and 100%). The comparative analysis reveals that the proposed method surpasses both ensemble feature selection and non-feature selection approaches in terms of feature reduction and ACC. Additionally, when compared to various other state-of-the-art methods, this approach demonstrates commendable performance.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3390684</doi><tpages>17</tpages><orcidid>https://orcid.org/0000-0003-0184-6975</orcidid><orcidid>https://orcid.org/0000-0002-2658-2745</orcidid><orcidid>https://orcid.org/0000-0002-6080-9720</orcidid><orcidid>https://orcid.org/0009-0002-3169-8352</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2024, Vol.12, p.62341-62357
issn 2169-3536
2169-3536
language eng
recordid cdi_proquest_journals_3052194454
source IEEE Open Access Journals; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals
subjects Accuracy
Algorithms
Classification
Classification algorithms
Datasets
differential evaluation
Feature extraction
Feature selection
filter approach
Filtering algorithms
HDLSS data
Heuristic methods
Information filters
Metaheuristics
Ranking
Search methods
Search problems
Support vector machines
wrapper approach
title Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T09%3A25%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20Subset%20Selection%20for%20High-Dimensional,%20Low%20Sampling%20Size%20Data%20Classification%20Using%20Ensemble%20Feature%20Selection%20With%20a%20Wrapper-Based%20Search&rft.jtitle=IEEE%20access&rft.au=Mandal,%20Ashis%20Kumar&rft.date=2024&rft.volume=12&rft.spage=62341&rft.epage=62357&rft.pages=62341-62357&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3390684&rft_dat=%3Cproquest_cross%3E3052194454%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3052194454&rft_id=info:pmid/&rft_ieee_id=10504829&rft_doaj_id=oai_doaj_org_article_06d6923e3a5e4dc0b17690f2c12390e0&rfr_iscdi=true