Stratified sampling for feature subspace selection in random forests for high dimensional data

For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative feature...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition 2013-03, Vol.46 (3), p.769-787
Hauptverfasser:	Ye, Yunming, Wu, Qingyao, Zhexue Huang, Joshua, Ng, Michael K., Li, Xutao
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Applied sciences Classification Decision trees Detection, estimation, filtering, equalization, prediction Ensemble classifier Exact sciences and technology Forests High-dimensional data Image processing Information, signal and communications theory Neural networks Pattern recognition Random forests Sampling Signal and communications theory Signal processing Signal representation. Spectral analysis Signal, noise Stratified sampling Subspaces Telecommunications and information theory
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	787
container_issue	3
container_start_page	769
container_title	Pattern recognition
container_volume	46
creator	Ye, Yunming Wu, Qingyao Zhexue Huang, Joshua Ng, Michael K. Li, Xutao
description	For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. ► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.
doi_str_mv	10.1016/j.patcog.2012.09.005
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1283661889</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320312003974</els_id><sourcerecordid>1283661889</sourcerecordid><originalsourceid>FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKv_wMNeBC-75mOTTS6CFL-g4EG9GmaTbJuyXyZbwX9vaotHTxOY552ZPAhdElwQTMTNphhhMsOqoJjQAqsCY36EZkRWLOekpMdohjEjOaOYnaKzGDcYkyo1ZujjdQow-cY7m0Xoxtb3q6wZQtY4mLbBZXFbxxFMerjWmckPfeb7LEBvh24HujjF38Dar9aZ9Z3rY4KgzSxMcI5OGmijuzjUOXp_uH9bPOXLl8fnxd0yN0yoKWfCguLcUgIYGMeSlUpKSbmxnJnUYg2FqpQAQoia8UYyqsqK1pwq7qBmc3S9nzuG4XObbtKdj8a1LfRu2EZNqGRCEClVQss9asIQY3CNHoPvIHxrgvVOp97ovU6906mx0klnil0dNkA00DbJgPHxL0uFSkpFlbjbPefSd7-8Czoa73rjrA_Jn7aD_3_RD5ntjTY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1283661889</pqid></control><display><type>article</type><title>Stratified sampling for feature subspace selection in random forests for high dimensional data</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Ye, Yunming ; Wu, Qingyao ; Zhexue Huang, Joshua ; Ng, Michael K. ; Li, Xutao</creator><creatorcontrib>Ye, Yunming ; Wu, Qingyao ; Zhexue Huang, Joshua ; Ng, Michael K. ; Li, Xutao</creatorcontrib><description>For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. ► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>DOI: 10.1016/j.patcog.2012.09.005</identifier><identifier>CODEN: PTNRA8</identifier><language>eng</language><publisher>Kidlington: Elsevier Ltd</publisher><subject>Algorithms ; Applied sciences ; Classification ; Decision trees ; Detection, estimation, filtering, equalization, prediction ; Ensemble classifier ; Exact sciences and technology ; Forests ; High-dimensional data ; Image processing ; Information, signal and communications theory ; Neural networks ; Pattern recognition ; Random forests ; Sampling ; Signal and communications theory ; Signal processing ; Signal representation. Spectral analysis ; Signal, noise ; Stratified sampling ; Subspaces ; Telecommunications and information theory</subject><ispartof>Pattern recognition, 2013-03, Vol.46 (3), p.769-787</ispartof><rights>2012 Elsevier Ltd</rights><rights>2014 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</citedby><cites>FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.patcog.2012.09.005$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=26901767$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Ye, Yunming</creatorcontrib><creatorcontrib>Wu, Qingyao</creatorcontrib><creatorcontrib>Zhexue Huang, Joshua</creatorcontrib><creatorcontrib>Ng, Michael K.</creatorcontrib><creatorcontrib>Li, Xutao</creatorcontrib><title>Stratified sampling for feature subspace selection in random forests for high dimensional data</title><title>Pattern recognition</title><description>For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. ► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.</description><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Classification</subject><subject>Decision trees</subject><subject>Detection, estimation, filtering, equalization, prediction</subject><subject>Ensemble classifier</subject><subject>Exact sciences and technology</subject><subject>Forests</subject><subject>High-dimensional data</subject><subject>Image processing</subject><subject>Information, signal and communications theory</subject><subject>Neural networks</subject><subject>Pattern recognition</subject><subject>Random forests</subject><subject>Sampling</subject><subject>Signal and communications theory</subject><subject>Signal processing</subject><subject>Signal representation. Spectral analysis</subject><subject>Signal, noise</subject><subject>Stratified sampling</subject><subject>Subspaces</subject><subject>Telecommunications and information theory</subject><issn>0031-3203</issn><issn>1873-5142</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LAzEQhoMoWKv_wMNeBC-75mOTTS6CFL-g4EG9GmaTbJuyXyZbwX9vaotHTxOY552ZPAhdElwQTMTNphhhMsOqoJjQAqsCY36EZkRWLOekpMdohjEjOaOYnaKzGDcYkyo1ZujjdQow-cY7m0Xoxtb3q6wZQtY4mLbBZXFbxxFMerjWmckPfeb7LEBvh24HujjF38Dar9aZ9Z3rY4KgzSxMcI5OGmijuzjUOXp_uH9bPOXLl8fnxd0yN0yoKWfCguLcUgIYGMeSlUpKSbmxnJnUYg2FqpQAQoia8UYyqsqK1pwq7qBmc3S9nzuG4XObbtKdj8a1LfRu2EZNqGRCEClVQss9asIQY3CNHoPvIHxrgvVOp97ovU6906mx0klnil0dNkA00DbJgPHxL0uFSkpFlbjbPefSd7-8Czoa73rjrA_Jn7aD_3_RD5ntjTY</recordid><startdate>20130301</startdate><enddate>20130301</enddate><creator>Ye, Yunming</creator><creator>Wu, Qingyao</creator><creator>Zhexue Huang, Joshua</creator><creator>Ng, Michael K.</creator><creator>Li, Xutao</creator><general>Elsevier Ltd</general><general>Elsevier</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20130301</creationdate><title>Stratified sampling for feature subspace selection in random forests for high dimensional data</title><author>Ye, Yunming ; Wu, Qingyao ; Zhexue Huang, Joshua ; Ng, Michael K. ; Li, Xutao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c369t-36da955d21a0a350834988825cd53ca953f2a748aa666b35f8329472b5295eab3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Classification</topic><topic>Decision trees</topic><topic>Detection, estimation, filtering, equalization, prediction</topic><topic>Ensemble classifier</topic><topic>Exact sciences and technology</topic><topic>Forests</topic><topic>High-dimensional data</topic><topic>Image processing</topic><topic>Information, signal and communications theory</topic><topic>Neural networks</topic><topic>Pattern recognition</topic><topic>Random forests</topic><topic>Sampling</topic><topic>Signal and communications theory</topic><topic>Signal processing</topic><topic>Signal representation. Spectral analysis</topic><topic>Signal, noise</topic><topic>Stratified sampling</topic><topic>Subspaces</topic><topic>Telecommunications and information theory</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ye, Yunming</creatorcontrib><creatorcontrib>Wu, Qingyao</creatorcontrib><creatorcontrib>Zhexue Huang, Joshua</creatorcontrib><creatorcontrib>Ng, Michael K.</creatorcontrib><creatorcontrib>Li, Xutao</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ye, Yunming</au><au>Wu, Qingyao</au><au>Zhexue Huang, Joshua</au><au>Ng, Michael K.</au><au>Li, Xutao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stratified sampling for feature subspace selection in random forests for high dimensional data</atitle><jtitle>Pattern recognition</jtitle><date>2013-03-01</date><risdate>2013</risdate><volume>46</volume><issue>3</issue><spage>769</spage><epage>787</epage><pages>769-787</pages><issn>0031-3203</issn><eissn>1873-5142</eissn><coden>PTNRA8</coden><abstract>For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. ► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.</abstract><cop>Kidlington</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2012.09.005</doi><tpages>19</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0031-3203
ispartof	Pattern recognition, 2013-03, Vol.46 (3), p.769-787
issn	0031-3203 1873-5142
language	eng
recordid	cdi_proquest_miscellaneous_1283661889
source	Elsevier ScienceDirect Journals Complete
subjects	Algorithms Applied sciences Classification Decision trees Detection, estimation, filtering, equalization, prediction Ensemble classifier Exact sciences and technology Forests High-dimensional data Image processing Information, signal and communications theory Neural networks Pattern recognition Random forests Sampling Signal and communications theory Signal processing Signal representation. Spectral analysis Signal, noise Stratified sampling Subspaces Telecommunications and information theory
title	Stratified sampling for feature subspace selection in random forests for high dimensional data
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T11%3A28%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stratified%20sampling%20for%20feature%20subspace%20selection%20in%20random%20forests%20for%20high%20dimensional%20data&rft.jtitle=Pattern%20recognition&rft.au=Ye,%20Yunming&rft.date=2013-03-01&rft.volume=46&rft.issue=3&rft.spage=769&rft.epage=787&rft.pages=769-787&rft.issn=0031-3203&rft.eissn=1873-5142&rft.coden=PTNRA8&rft_id=info:doi/10.1016/j.patcog.2012.09.005&rft_dat=%3Cproquest_cross%3E1283661889%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1283661889&rft_id=info:pmid/&rft_els_id=S0031320312003974&rfr_iscdi=true