Mining Big Data with Random Forests
In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classificati...
Gespeichert in:
Veröffentlicht in: | Cognitive computation 2019-04, Vol.11 (2), p.294-316 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 316 |
---|---|
container_issue | 2 |
container_start_page | 294 |
container_title | Cognitive computation |
container_volume | 11 |
creator | Lulli, Alessandro Oneto, Luca Anguita, Davide |
description | In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective. |
doi_str_mv | 10.1007/s12559-018-9615-4 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2919516615</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2919516615</sourcerecordid><originalsourceid>FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</originalsourceid><addsrcrecordid>eNp1kM9LwzAUx4MoOKd_gLfCztG8NC_tO-p0KkwE0XNI2nR2uHYmHeJ_b0ZFT57e9_D9wfswdg7iAoQoLiNIROICSk4akKsDNoFSa06k1eGvRn3MTmJcC6GRUE7Y7LHt2m6VXber7MYONvtsh7fs2XZ1v8kWffBxiKfsqLHv0Z_93Cl7Xdy-zO_58unuYX615FUOeuB1UyunBGqsqypHygvwwmEpvcVSoazJOSwcYlOLhgrtrfVkhULy1uXg8ymbjb3b0H_s0rJZ97vQpUkjCQhBp8-SC0ZXFfoYg2_MNrQbG74MCLNnYUYWJrEwexZGpYwcMzF5u5UPf83_h74BjIBfog</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2919516615</pqid></control><display><type>article</type><title>Mining Big Data with Random Forests</title><source>ProQuest Central UK/Ireland</source><source>SpringerLink Journals - AutoHoldings</source><source>ProQuest Central</source><creator>Lulli, Alessandro ; Oneto, Luca ; Anguita, Davide</creator><creatorcontrib>Lulli, Alessandro ; Oneto, Luca ; Anguita, Davide</creatorcontrib><description>In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.</description><identifier>ISSN: 1866-9956</identifier><identifier>EISSN: 1866-9964</identifier><identifier>DOI: 10.1007/s12559-018-9615-4</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Accuracy ; Algorithms ; Artificial Intelligence ; Big Data ; Biomedical and Life Sciences ; Biomedicine ; Classification ; Computation by Abstract Devices ; Computational Biology/Bioinformatics ; Datasets ; Feature selection ; Machine learning ; Memory ; Neurosciences ; Reforestation ; Trees</subject><ispartof>Cognitive computation, 2019-04, Vol.11 (2), p.294-316</ispartof><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019</rights><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</citedby><cites>FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</cites><orcidid>0000-0002-8445-395X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s12559-018-9615-4$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2919516615?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,780,784,21388,27924,27925,33744,41488,42557,43805,51319,64385,64389,72469</link.rule.ids></links><search><creatorcontrib>Lulli, Alessandro</creatorcontrib><creatorcontrib>Oneto, Luca</creatorcontrib><creatorcontrib>Anguita, Davide</creatorcontrib><title>Mining Big Data with Random Forests</title><title>Cognitive computation</title><addtitle>Cogn Comput</addtitle><description>In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Artificial Intelligence</subject><subject>Big Data</subject><subject>Biomedical and Life Sciences</subject><subject>Biomedicine</subject><subject>Classification</subject><subject>Computation by Abstract Devices</subject><subject>Computational Biology/Bioinformatics</subject><subject>Datasets</subject><subject>Feature selection</subject><subject>Machine learning</subject><subject>Memory</subject><subject>Neurosciences</subject><subject>Reforestation</subject><subject>Trees</subject><issn>1866-9956</issn><issn>1866-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1kM9LwzAUx4MoOKd_gLfCztG8NC_tO-p0KkwE0XNI2nR2uHYmHeJ_b0ZFT57e9_D9wfswdg7iAoQoLiNIROICSk4akKsDNoFSa06k1eGvRn3MTmJcC6GRUE7Y7LHt2m6VXber7MYONvtsh7fs2XZ1v8kWffBxiKfsqLHv0Z_93Cl7Xdy-zO_58unuYX615FUOeuB1UyunBGqsqypHygvwwmEpvcVSoazJOSwcYlOLhgrtrfVkhULy1uXg8ymbjb3b0H_s0rJZ97vQpUkjCQhBp8-SC0ZXFfoYg2_MNrQbG74MCLNnYUYWJrEwexZGpYwcMzF5u5UPf83_h74BjIBfog</recordid><startdate>20190401</startdate><enddate>20190401</enddate><creator>Lulli, Alessandro</creator><creator>Oneto, Luca</creator><creator>Anguita, Davide</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><orcidid>https://orcid.org/0000-0002-8445-395X</orcidid></search><sort><creationdate>20190401</creationdate><title>Mining Big Data with Random Forests</title><author>Lulli, Alessandro ; Oneto, Luca ; Anguita, Davide</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Artificial Intelligence</topic><topic>Big Data</topic><topic>Biomedical and Life Sciences</topic><topic>Biomedicine</topic><topic>Classification</topic><topic>Computation by Abstract Devices</topic><topic>Computational Biology/Bioinformatics</topic><topic>Datasets</topic><topic>Feature selection</topic><topic>Machine learning</topic><topic>Memory</topic><topic>Neurosciences</topic><topic>Reforestation</topic><topic>Trees</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lulli, Alessandro</creatorcontrib><creatorcontrib>Oneto, Luca</creatorcontrib><creatorcontrib>Anguita, Davide</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><jtitle>Cognitive computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lulli, Alessandro</au><au>Oneto, Luca</au><au>Anguita, Davide</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mining Big Data with Random Forests</atitle><jtitle>Cognitive computation</jtitle><stitle>Cogn Comput</stitle><date>2019-04-01</date><risdate>2019</risdate><volume>11</volume><issue>2</issue><spage>294</spage><epage>316</epage><pages>294-316</pages><issn>1866-9956</issn><eissn>1866-9964</eissn><abstract>In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s12559-018-9615-4</doi><tpages>23</tpages><orcidid>https://orcid.org/0000-0002-8445-395X</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1866-9956 |
ispartof | Cognitive computation, 2019-04, Vol.11 (2), p.294-316 |
issn | 1866-9956 1866-9964 |
language | eng |
recordid | cdi_proquest_journals_2919516615 |
source | ProQuest Central UK/Ireland; SpringerLink Journals - AutoHoldings; ProQuest Central |
subjects | Accuracy Algorithms Artificial Intelligence Big Data Biomedical and Life Sciences Biomedicine Classification Computation by Abstract Devices Computational Biology/Bioinformatics Datasets Feature selection Machine learning Memory Neurosciences Reforestation Trees |
title | Mining Big Data with Random Forests |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T13%3A26%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mining%20Big%20Data%20with%20Random%20Forests&rft.jtitle=Cognitive%20computation&rft.au=Lulli,%20Alessandro&rft.date=2019-04-01&rft.volume=11&rft.issue=2&rft.spage=294&rft.epage=316&rft.pages=294-316&rft.issn=1866-9956&rft.eissn=1866-9964&rft_id=info:doi/10.1007/s12559-018-9615-4&rft_dat=%3Cproquest_cross%3E2919516615%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2919516615&rft_id=info:pmid/&rfr_iscdi=true |