Stratified sampling for feature subspace selection in random forests for high dimensional data
For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative feature...
Gespeichert in:
Veröffentlicht in: | Pattern recognition 2013-03, Vol.46 (3), p.769-787 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 787 |
---|---|
container_issue | 3 |
container_start_page | 769 |
container_title | Pattern recognition |
container_volume | 46 |
creator | Ye, Yunming Wu, Qingyao Zhexue Huang, Joshua Ng, Michael K. Li, Xutao |
description | For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms.
► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method. |
doi_str_mv | 10.1016/j.patcog.2012.09.005 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1283661889</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320312003974</els_id><sourcerecordid>1283661889</sourcerecordid><originalsourceid>FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKv_wMNeBC-75mOTTS6CFL-g4EG9GmaTbJuyXyZbwX9vaotHTxOY552ZPAhdElwQTMTNphhhMsOqoJjQAqsCY36EZkRWLOekpMdohjEjOaOYnaKzGDcYkyo1ZujjdQow-cY7m0Xoxtb3q6wZQtY4mLbBZXFbxxFMerjWmckPfeb7LEBvh24HujjF38Dar9aZ9Z3rY4KgzSxMcI5OGmijuzjUOXp_uH9bPOXLl8fnxd0yN0yoKWfCguLcUgIYGMeSlUpKSbmxnJnUYg2FqpQAQoia8UYyqsqK1pwq7qBmc3S9nzuG4XObbtKdj8a1LfRu2EZNqGRCEClVQss9asIQY3CNHoPvIHxrgvVOp97ovU6906mx0klnil0dNkA00DbJgPHxL0uFSkpFlbjbPefSd7-8Czoa73rjrA_Jn7aD_3_RD5ntjTY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1283661889</pqid></control><display><type>article</type><title>Stratified sampling for feature subspace selection in random forests for high dimensional data</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Ye, Yunming ; Wu, Qingyao ; Zhexue Huang, Joshua ; Ng, Michael K. ; Li, Xutao</creator><creatorcontrib>Ye, Yunming ; Wu, Qingyao ; Zhexue Huang, Joshua ; Ng, Michael K. ; Li, Xutao</creatorcontrib><description>For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms.
► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>DOI: 10.1016/j.patcog.2012.09.005</identifier><identifier>CODEN: PTNRA8</identifier><language>eng</language><publisher>Kidlington: Elsevier Ltd</publisher><subject>Algorithms ; Applied sciences ; Classification ; Decision trees ; Detection, estimation, filtering, equalization, prediction ; Ensemble classifier ; Exact sciences and technology ; Forests ; High-dimensional data ; Image processing ; Information, signal and communications theory ; Neural networks ; Pattern recognition ; Random forests ; Sampling ; Signal and communications theory ; Signal processing ; Signal representation. Spectral analysis ; Signal, noise ; Stratified sampling ; Subspaces ; Telecommunications and information theory</subject><ispartof>Pattern recognition, 2013-03, Vol.46 (3), p.769-787</ispartof><rights>2012 Elsevier Ltd</rights><rights>2014 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</citedby><cites>FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.patcog.2012.09.005$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=26901767$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Ye, Yunming</creatorcontrib><creatorcontrib>Wu, Qingyao</creatorcontrib><creatorcontrib>Zhexue Huang, Joshua</creatorcontrib><creatorcontrib>Ng, Michael K.</creatorcontrib><creatorcontrib>Li, Xutao</creatorcontrib><title>Stratified sampling for feature subspace selection in random forests for high dimensional data</title><title>Pattern recognition</title><description>For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms.
► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.</description><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Classification</subject><subject>Decision trees</subject><subject>Detection, estimation, filtering, equalization, prediction</subject><subject>Ensemble classifier</subject><subject>Exact sciences and technology</subject><subject>Forests</subject><subject>High-dimensional data</subject><subject>Image processing</subject><subject>Information, signal and communications theory</subject><subject>Neural networks</subject><subject>Pattern recognition</subject><subject>Random forests</subject><subject>Sampling</subject><subject>Signal and communications theory</subject><subject>Signal processing</subject><subject>Signal representation. Spectral analysis</subject><subject>Signal, noise</subject><subject>Stratified sampling</subject><subject>Subspaces</subject><subject>Telecommunications and information theory</subject><issn>0031-3203</issn><issn>1873-5142</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LAzEQhoMoWKv_wMNeBC-75mOTTS6CFL-g4EG9GmaTbJuyXyZbwX9vaotHTxOY552ZPAhdElwQTMTNphhhMsOqoJjQAqsCY36EZkRWLOekpMdohjEjOaOYnaKzGDcYkyo1ZujjdQow-cY7m0Xoxtb3q6wZQtY4mLbBZXFbxxFMerjWmckPfeb7LEBvh24HujjF38Dar9aZ9Z3rY4KgzSxMcI5OGmijuzjUOXp_uH9bPOXLl8fnxd0yN0yoKWfCguLcUgIYGMeSlUpKSbmxnJnUYg2FqpQAQoia8UYyqsqK1pwq7qBmc3S9nzuG4XObbtKdj8a1LfRu2EZNqGRCEClVQss9asIQY3CNHoPvIHxrgvVOp97ovU6906mx0klnil0dNkA00DbJgPHxL0uFSkpFlbjbPefSd7-8Czoa73rjrA_Jn7aD_3_RD5ntjTY</recordid><startdate>20130301</startdate><enddate>20130301</enddate><creator>Ye, Yunming</creator><creator>Wu, Qingyao</creator><creator>Zhexue Huang, Joshua</creator><creator>Ng, Michael K.</creator><creator>Li, Xutao</creator><general>Elsevier Ltd</general><general>Elsevier</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20130301</creationdate><title>Stratified sampling for feature subspace selection in random forests for high dimensional data</title><author>Ye, Yunming ; Wu, Qingyao ; Zhexue Huang, Joshua ; Ng, Michael K. ; Li, Xutao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Classification</topic><topic>Decision trees</topic><topic>Detection, estimation, filtering, equalization, prediction</topic><topic>Ensemble classifier</topic><topic>Exact sciences and technology</topic><topic>Forests</topic><topic>High-dimensional data</topic><topic>Image processing</topic><topic>Information, signal and communications theory</topic><topic>Neural networks</topic><topic>Pattern recognition</topic><topic>Random forests</topic><topic>Sampling</topic><topic>Signal and communications theory</topic><topic>Signal processing</topic><topic>Signal representation. Spectral analysis</topic><topic>Signal, noise</topic><topic>Stratified sampling</topic><topic>Subspaces</topic><topic>Telecommunications and information theory</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ye, Yunming</creatorcontrib><creatorcontrib>Wu, Qingyao</creatorcontrib><creatorcontrib>Zhexue Huang, Joshua</creatorcontrib><creatorcontrib>Ng, Michael K.</creatorcontrib><creatorcontrib>Li, Xutao</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ye, Yunming</au><au>Wu, Qingyao</au><au>Zhexue Huang, Joshua</au><au>Ng, Michael K.</au><au>Li, Xutao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stratified sampling for feature subspace selection in random forests for high dimensional data</atitle><jtitle>Pattern recognition</jtitle><date>2013-03-01</date><risdate>2013</risdate><volume>46</volume><issue>3</issue><spage>769</spage><epage>787</epage><pages>769-787</pages><issn>0031-3203</issn><eissn>1873-5142</eissn><coden>PTNRA8</coden><abstract>For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms.
► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.</abstract><cop>Kidlington</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2012.09.005</doi><tpages>19</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0031-3203 |
ispartof | Pattern recognition, 2013-03, Vol.46 (3), p.769-787 |
issn | 0031-3203 1873-5142 |
language | eng |
recordid | cdi_proquest_miscellaneous_1283661889 |
source | Elsevier ScienceDirect Journals Complete |
subjects | Algorithms Applied sciences Classification Decision trees Detection, estimation, filtering, equalization, prediction Ensemble classifier Exact sciences and technology Forests High-dimensional data Image processing Information, signal and communications theory Neural networks Pattern recognition Random forests Sampling Signal and communications theory Signal processing Signal representation. Spectral analysis Signal, noise Stratified sampling Subspaces Telecommunications and information theory |
title | Stratified sampling for feature subspace selection in random forests for high dimensional data |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T11%3A28%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stratified%20sampling%20for%20feature%20subspace%20selection%20in%20random%20forests%20for%20high%20dimensional%20data&rft.jtitle=Pattern%20recognition&rft.au=Ye,%20Yunming&rft.date=2013-03-01&rft.volume=46&rft.issue=3&rft.spage=769&rft.epage=787&rft.pages=769-787&rft.issn=0031-3203&rft.eissn=1873-5142&rft.coden=PTNRA8&rft_id=info:doi/10.1016/j.patcog.2012.09.005&rft_dat=%3Cproquest_cross%3E1283661889%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1283661889&rft_id=info:pmid/&rft_els_id=S0031320312003974&rfr_iscdi=true |