Stratified sampling for feature subspace selection in random forests for high dimensional data

For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative feature...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2013-03, Vol.46 (3), p.769-787
Hauptverfasser: Ye, Yunming, Wu, Qingyao, Zhexue Huang, Joshua, Ng, Michael K., Li, Xutao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 787
container_issue 3
container_start_page 769
container_title Pattern recognition
container_volume 46
creator Ye, Yunming
Wu, Qingyao
Zhexue Huang, Joshua
Ng, Michael K.
Li, Xutao
description For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. ► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.
doi_str_mv 10.1016/j.patcog.2012.09.005
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1283661889</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320312003974</els_id><sourcerecordid>1283661889</sourcerecordid><originalsourceid>FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKv_wMNeBC-75mOTTS6CFL-g4EG9GmaTbJuyXyZbwX9vaotHTxOY552ZPAhdElwQTMTNphhhMsOqoJjQAqsCY36EZkRWLOekpMdohjEjOaOYnaKzGDcYkyo1ZujjdQow-cY7m0Xoxtb3q6wZQtY4mLbBZXFbxxFMerjWmckPfeb7LEBvh24HujjF38Dar9aZ9Z3rY4KgzSxMcI5OGmijuzjUOXp_uH9bPOXLl8fnxd0yN0yoKWfCguLcUgIYGMeSlUpKSbmxnJnUYg2FqpQAQoia8UYyqsqK1pwq7qBmc3S9nzuG4XObbtKdj8a1LfRu2EZNqGRCEClVQss9asIQY3CNHoPvIHxrgvVOp97ovU6906mx0klnil0dNkA00DbJgPHxL0uFSkpFlbjbPefSd7-8Czoa73rjrA_Jn7aD_3_RD5ntjTY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1283661889</pqid></control><display><type>article</type><title>Stratified sampling for feature subspace selection in random forests for high dimensional data</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Ye, Yunming ; Wu, Qingyao ; Zhexue Huang, Joshua ; Ng, Michael K. ; Li, Xutao</creator><creatorcontrib>Ye, Yunming ; Wu, Qingyao ; Zhexue Huang, Joshua ; Ng, Michael K. ; Li, Xutao</creatorcontrib><description>For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. ► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>DOI: 10.1016/j.patcog.2012.09.005</identifier><identifier>CODEN: PTNRA8</identifier><language>eng</language><publisher>Kidlington: Elsevier Ltd</publisher><subject>Algorithms ; Applied sciences ; Classification ; Decision trees ; Detection, estimation, filtering, equalization, prediction ; Ensemble classifier ; Exact sciences and technology ; Forests ; High-dimensional data ; Image processing ; Information, signal and communications theory ; Neural networks ; Pattern recognition ; Random forests ; Sampling ; Signal and communications theory ; Signal processing ; Signal representation. Spectral analysis ; Signal, noise ; Stratified sampling ; Subspaces ; Telecommunications and information theory</subject><ispartof>Pattern recognition, 2013-03, Vol.46 (3), p.769-787</ispartof><rights>2012 Elsevier Ltd</rights><rights>2014 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</citedby><cites>FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.patcog.2012.09.005$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=26901767$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Ye, Yunming</creatorcontrib><creatorcontrib>Wu, Qingyao</creatorcontrib><creatorcontrib>Zhexue Huang, Joshua</creatorcontrib><creatorcontrib>Ng, Michael K.</creatorcontrib><creatorcontrib>Li, Xutao</creatorcontrib><title>Stratified sampling for feature subspace selection in random forests for high dimensional data</title><title>Pattern recognition</title><description>For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. ► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.</description><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Classification</subject><subject>Decision trees</subject><subject>Detection, estimation, filtering, equalization, prediction</subject><subject>Ensemble classifier</subject><subject>Exact sciences and technology</subject><subject>Forests</subject><subject>High-dimensional data</subject><subject>Image processing</subject><subject>Information, signal and communications theory</subject><subject>Neural networks</subject><subject>Pattern recognition</subject><subject>Random forests</subject><subject>Sampling</subject><subject>Signal and communications theory</subject><subject>Signal processing</subject><subject>Signal representation. Spectral analysis</subject><subject>Signal, noise</subject><subject>Stratified sampling</subject><subject>Subspaces</subject><subject>Telecommunications and information theory</subject><issn>0031-3203</issn><issn>1873-5142</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LAzEQhoMoWKv_wMNeBC-75mOTTS6CFL-g4EG9GmaTbJuyXyZbwX9vaotHTxOY552ZPAhdElwQTMTNphhhMsOqoJjQAqsCY36EZkRWLOekpMdohjEjOaOYnaKzGDcYkyo1ZujjdQow-cY7m0Xoxtb3q6wZQtY4mLbBZXFbxxFMerjWmckPfeb7LEBvh24HujjF38Dar9aZ9Z3rY4KgzSxMcI5OGmijuzjUOXp_uH9bPOXLl8fnxd0yN0yoKWfCguLcUgIYGMeSlUpKSbmxnJnUYg2FqpQAQoia8UYyqsqK1pwq7qBmc3S9nzuG4XObbtKdj8a1LfRu2EZNqGRCEClVQss9asIQY3CNHoPvIHxrgvVOp97ovU6906mx0klnil0dNkA00DbJgPHxL0uFSkpFlbjbPefSd7-8Czoa73rjrA_Jn7aD_3_RD5ntjTY</recordid><startdate>20130301</startdate><enddate>20130301</enddate><creator>Ye, Yunming</creator><creator>Wu, Qingyao</creator><creator>Zhexue Huang, Joshua</creator><creator>Ng, Michael K.</creator><creator>Li, Xutao</creator><general>Elsevier Ltd</general><general>Elsevier</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20130301</creationdate><title>Stratified sampling for feature subspace selection in random forests for high dimensional data</title><author>Ye, Yunming ; Wu, Qingyao ; Zhexue Huang, Joshua ; Ng, Michael K. ; Li, Xutao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Classification</topic><topic>Decision trees</topic><topic>Detection, estimation, filtering, equalization, prediction</topic><topic>Ensemble classifier</topic><topic>Exact sciences and technology</topic><topic>Forests</topic><topic>High-dimensional data</topic><topic>Image processing</topic><topic>Information, signal and communications theory</topic><topic>Neural networks</topic><topic>Pattern recognition</topic><topic>Random forests</topic><topic>Sampling</topic><topic>Signal and communications theory</topic><topic>Signal processing</topic><topic>Signal representation. Spectral analysis</topic><topic>Signal, noise</topic><topic>Stratified sampling</topic><topic>Subspaces</topic><topic>Telecommunications and information theory</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ye, Yunming</creatorcontrib><creatorcontrib>Wu, Qingyao</creatorcontrib><creatorcontrib>Zhexue Huang, Joshua</creatorcontrib><creatorcontrib>Ng, Michael K.</creatorcontrib><creatorcontrib>Li, Xutao</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ye, Yunming</au><au>Wu, Qingyao</au><au>Zhexue Huang, Joshua</au><au>Ng, Michael K.</au><au>Li, Xutao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stratified sampling for feature subspace selection in random forests for high dimensional data</atitle><jtitle>Pattern recognition</jtitle><date>2013-03-01</date><risdate>2013</risdate><volume>46</volume><issue>3</issue><spage>769</spage><epage>787</epage><pages>769-787</pages><issn>0031-3203</issn><eissn>1873-5142</eissn><coden>PTNRA8</coden><abstract>For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. ► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.</abstract><cop>Kidlington</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2012.09.005</doi><tpages>19</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0031-3203
ispartof Pattern recognition, 2013-03, Vol.46 (3), p.769-787
issn 0031-3203
1873-5142
language eng
recordid cdi_proquest_miscellaneous_1283661889
source Elsevier ScienceDirect Journals Complete
subjects Algorithms
Applied sciences
Classification
Decision trees
Detection, estimation, filtering, equalization, prediction
Ensemble classifier
Exact sciences and technology
Forests
High-dimensional data
Image processing
Information, signal and communications theory
Neural networks
Pattern recognition
Random forests
Sampling
Signal and communications theory
Signal processing
Signal representation. Spectral analysis
Signal, noise
Stratified sampling
Subspaces
Telecommunications and information theory
title Stratified sampling for feature subspace selection in random forests for high dimensional data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T11%3A28%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stratified%20sampling%20for%20feature%20subspace%20selection%20in%20random%20forests%20for%20high%20dimensional%20data&rft.jtitle=Pattern%20recognition&rft.au=Ye,%20Yunming&rft.date=2013-03-01&rft.volume=46&rft.issue=3&rft.spage=769&rft.epage=787&rft.pages=769-787&rft.issn=0031-3203&rft.eissn=1873-5142&rft.coden=PTNRA8&rft_id=info:doi/10.1016/j.patcog.2012.09.005&rft_dat=%3Cproquest_cross%3E1283661889%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1283661889&rft_id=info:pmid/&rft_els_id=S0031320312003974&rfr_iscdi=true