Mining Big Data with Random Forests

In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classificati...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Cognitive computation 2019-04, Vol.11 (2), p.294-316
Hauptverfasser: Lulli, Alessandro, Oneto, Luca, Anguita, Davide
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 316
container_issue 2
container_start_page 294
container_title Cognitive computation
container_volume 11
creator Lulli, Alessandro
Oneto, Luca
Anguita, Davide
description In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.
doi_str_mv 10.1007/s12559-018-9615-4
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2919516615</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2919516615</sourcerecordid><originalsourceid>FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</originalsourceid><addsrcrecordid>eNp1kM9LwzAUx4MoOKd_gLfCztG8NC_tO-p0KkwE0XNI2nR2uHYmHeJ_b0ZFT57e9_D9wfswdg7iAoQoLiNIROICSk4akKsDNoFSa06k1eGvRn3MTmJcC6GRUE7Y7LHt2m6VXber7MYONvtsh7fs2XZ1v8kWffBxiKfsqLHv0Z_93Cl7Xdy-zO_58unuYX615FUOeuB1UyunBGqsqypHygvwwmEpvcVSoazJOSwcYlOLhgrtrfVkhULy1uXg8ymbjb3b0H_s0rJZ97vQpUkjCQhBp8-SC0ZXFfoYg2_MNrQbG74MCLNnYUYWJrEwexZGpYwcMzF5u5UPf83_h74BjIBfog</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2919516615</pqid></control><display><type>article</type><title>Mining Big Data with Random Forests</title><source>ProQuest Central UK/Ireland</source><source>SpringerLink Journals - AutoHoldings</source><source>ProQuest Central</source><creator>Lulli, Alessandro ; Oneto, Luca ; Anguita, Davide</creator><creatorcontrib>Lulli, Alessandro ; Oneto, Luca ; Anguita, Davide</creatorcontrib><description>In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.</description><identifier>ISSN: 1866-9956</identifier><identifier>EISSN: 1866-9964</identifier><identifier>DOI: 10.1007/s12559-018-9615-4</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Accuracy ; Algorithms ; Artificial Intelligence ; Big Data ; Biomedical and Life Sciences ; Biomedicine ; Classification ; Computation by Abstract Devices ; Computational Biology/Bioinformatics ; Datasets ; Feature selection ; Machine learning ; Memory ; Neurosciences ; Reforestation ; Trees</subject><ispartof>Cognitive computation, 2019-04, Vol.11 (2), p.294-316</ispartof><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019</rights><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</citedby><cites>FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</cites><orcidid>0000-0002-8445-395X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s12559-018-9615-4$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2919516615?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,780,784,21388,27924,27925,33744,41488,42557,43805,51319,64385,64389,72469</link.rule.ids></links><search><creatorcontrib>Lulli, Alessandro</creatorcontrib><creatorcontrib>Oneto, Luca</creatorcontrib><creatorcontrib>Anguita, Davide</creatorcontrib><title>Mining Big Data with Random Forests</title><title>Cognitive computation</title><addtitle>Cogn Comput</addtitle><description>In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Artificial Intelligence</subject><subject>Big Data</subject><subject>Biomedical and Life Sciences</subject><subject>Biomedicine</subject><subject>Classification</subject><subject>Computation by Abstract Devices</subject><subject>Computational Biology/Bioinformatics</subject><subject>Datasets</subject><subject>Feature selection</subject><subject>Machine learning</subject><subject>Memory</subject><subject>Neurosciences</subject><subject>Reforestation</subject><subject>Trees</subject><issn>1866-9956</issn><issn>1866-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1kM9LwzAUx4MoOKd_gLfCztG8NC_tO-p0KkwE0XNI2nR2uHYmHeJ_b0ZFT57e9_D9wfswdg7iAoQoLiNIROICSk4akKsDNoFSa06k1eGvRn3MTmJcC6GRUE7Y7LHt2m6VXber7MYONvtsh7fs2XZ1v8kWffBxiKfsqLHv0Z_93Cl7Xdy-zO_58unuYX615FUOeuB1UyunBGqsqypHygvwwmEpvcVSoazJOSwcYlOLhgrtrfVkhULy1uXg8ymbjb3b0H_s0rJZ97vQpUkjCQhBp8-SC0ZXFfoYg2_MNrQbG74MCLNnYUYWJrEwexZGpYwcMzF5u5UPf83_h74BjIBfog</recordid><startdate>20190401</startdate><enddate>20190401</enddate><creator>Lulli, Alessandro</creator><creator>Oneto, Luca</creator><creator>Anguita, Davide</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><orcidid>https://orcid.org/0000-0002-8445-395X</orcidid></search><sort><creationdate>20190401</creationdate><title>Mining Big Data with Random Forests</title><author>Lulli, Alessandro ; Oneto, Luca ; Anguita, Davide</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Artificial Intelligence</topic><topic>Big Data</topic><topic>Biomedical and Life Sciences</topic><topic>Biomedicine</topic><topic>Classification</topic><topic>Computation by Abstract Devices</topic><topic>Computational Biology/Bioinformatics</topic><topic>Datasets</topic><topic>Feature selection</topic><topic>Machine learning</topic><topic>Memory</topic><topic>Neurosciences</topic><topic>Reforestation</topic><topic>Trees</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lulli, Alessandro</creatorcontrib><creatorcontrib>Oneto, Luca</creatorcontrib><creatorcontrib>Anguita, Davide</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><jtitle>Cognitive computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lulli, Alessandro</au><au>Oneto, Luca</au><au>Anguita, Davide</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mining Big Data with Random Forests</atitle><jtitle>Cognitive computation</jtitle><stitle>Cogn Comput</stitle><date>2019-04-01</date><risdate>2019</risdate><volume>11</volume><issue>2</issue><spage>294</spage><epage>316</epage><pages>294-316</pages><issn>1866-9956</issn><eissn>1866-9964</eissn><abstract>In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s12559-018-9615-4</doi><tpages>23</tpages><orcidid>https://orcid.org/0000-0002-8445-395X</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1866-9956
ispartof Cognitive computation, 2019-04, Vol.11 (2), p.294-316
issn 1866-9956
1866-9964
language eng
recordid cdi_proquest_journals_2919516615
source ProQuest Central UK/Ireland; SpringerLink Journals - AutoHoldings; ProQuest Central
subjects Accuracy
Algorithms
Artificial Intelligence
Big Data
Biomedical and Life Sciences
Biomedicine
Classification
Computation by Abstract Devices
Computational Biology/Bioinformatics
Datasets
Feature selection
Machine learning
Memory
Neurosciences
Reforestation
Trees
title Mining Big Data with Random Forests
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T13%3A26%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mining%20Big%20Data%20with%20Random%20Forests&rft.jtitle=Cognitive%20computation&rft.au=Lulli,%20Alessandro&rft.date=2019-04-01&rft.volume=11&rft.issue=2&rft.spage=294&rft.epage=316&rft.pages=294-316&rft.issn=1866-9956&rft.eissn=1866-9964&rft_id=info:doi/10.1007/s12559-018-9615-4&rft_dat=%3Cproquest_cross%3E2919516615%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2919516615&rft_id=info:pmid/&rfr_iscdi=true