Mining Big Data with Random Forests

In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classificati...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Cognitive computation 2019-04, Vol.11 (2), p.294-316
Hauptverfasser:	Lulli, Alessandro, Oneto, Luca, Anguita, Davide
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Algorithms Artificial Intelligence Big Data Biomedical and Life Sciences Biomedicine Classification Computation by Abstract Devices Computational Biology/Bioinformatics Datasets Feature selection Machine learning Memory Neurosciences Reforestation Trees
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	316
container_issue	2
container_start_page	294
container_title	Cognitive computation
container_volume	11
creator	Lulli, Alessandro Oneto, Luca Anguita, Davide
description	In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.
doi_str_mv	10.1007/s12559-018-9615-4
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2919516615</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2919516615</sourcerecordid><originalsourceid>FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</originalsourceid><addsrcrecordid>eNp1kM9LwzAUx4MoOKd_gLfCztG8NC_tO-p0KkwE0XNI2nR2uHYmHeJ_b0ZFT57e9_D9wfswdg7iAoQoLiNIROICSk4akKsDNoFSa06k1eGvRn3MTmJcC6GRUE7Y7LHt2m6VXber7MYONvtsh7fs2XZ1v8kWffBxiKfsqLHv0Z_93Cl7Xdy-zO_58unuYX615FUOeuB1UyunBGqsqypHygvwwmEpvcVSoazJOSwcYlOLhgrtrfVkhULy1uXg8ymbjb3b0H_s0rJZ97vQpUkjCQhBp8-SC0ZXFfoYg2_MNrQbG74MCLNnYUYWJrEwexZGpYwcMzF5u5UPf83_h74BjIBfog</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2919516615</pqid></control><display><type>article</type><title>Mining Big Data with Random Forests</title><source>ProQuest Central UK/Ireland</source><source>SpringerLink Journals - AutoHoldings</source><source>ProQuest Central</source><creator>Lulli, Alessandro ; Oneto, Luca ; Anguita, Davide</creator><creatorcontrib>Lulli, Alessandro ; Oneto, Luca ; Anguita, Davide</creatorcontrib><description>In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.</description><identifier>ISSN: 1866-9956</identifier><identifier>EISSN: 1866-9964</identifier><identifier>DOI: 10.1007/s12559-018-9615-4</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Accuracy ; Algorithms ; Artificial Intelligence ; Big Data ; Biomedical and Life Sciences ; Biomedicine ; Classification ; Computation by Abstract Devices ; Computational Biology/Bioinformatics ; Datasets ; Feature selection ; Machine learning ; Memory ; Neurosciences ; Reforestation ; Trees</subject><ispartof>Cognitive computation, 2019-04, Vol.11 (2), p.294-316</ispartof><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019</rights><rights>Springer Science+Business Media, LLC, part of Springer Nature 2019.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</citedby><cites>FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</cites><orcidid>0000-0002-8445-395X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s12559-018-9615-4$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2919516615?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,780,784,21388,27924,27925,33744,41488,42557,43805,51319,64385,64389,72469</link.rule.ids></links><search><creatorcontrib>Lulli, Alessandro</creatorcontrib><creatorcontrib>Oneto, Luca</creatorcontrib><creatorcontrib>Anguita, Davide</creatorcontrib><title>Mining Big Data with Random Forests</title><title>Cognitive computation</title><addtitle>Cogn Comput</addtitle><description>In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Artificial Intelligence</subject><subject>Big Data</subject><subject>Biomedical and Life Sciences</subject><subject>Biomedicine</subject><subject>Classification</subject><subject>Computation by Abstract Devices</subject><subject>Computational Biology/Bioinformatics</subject><subject>Datasets</subject><subject>Feature selection</subject><subject>Machine learning</subject><subject>Memory</subject><subject>Neurosciences</subject><subject>Reforestation</subject><subject>Trees</subject><issn>1866-9956</issn><issn>1866-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1kM9LwzAUx4MoOKd_gLfCztG8NC_tO-p0KkwE0XNI2nR2uHYmHeJ_b0ZFT57e9_D9wfswdg7iAoQoLiNIROICSk4akKsDNoFSa06k1eGvRn3MTmJcC6GRUE7Y7LHt2m6VXber7MYONvtsh7fs2XZ1v8kWffBxiKfsqLHv0Z_93Cl7Xdy-zO_58unuYX615FUOeuB1UyunBGqsqypHygvwwmEpvcVSoazJOSwcYlOLhgrtrfVkhULy1uXg8ymbjb3b0H_s0rJZ97vQpUkjCQhBp8-SC0ZXFfoYg2_MNrQbG74MCLNnYUYWJrEwexZGpYwcMzF5u5UPf83_h74BjIBfog</recordid><startdate>20190401</startdate><enddate>20190401</enddate><creator>Lulli, Alessandro</creator><creator>Oneto, Luca</creator><creator>Anguita, Davide</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><orcidid>https://orcid.org/0000-0002-8445-395X</orcidid></search><sort><creationdate>20190401</creationdate><title>Mining Big Data with Random Forests</title><author>Lulli, Alessandro ; Oneto, Luca ; Anguita, Davide</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c316t-dfd4b40565dcc359371e0b582ea58452d9bb57b55fd0f976eaae9a0459eab31e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Artificial Intelligence</topic><topic>Big Data</topic><topic>Biomedical and Life Sciences</topic><topic>Biomedicine</topic><topic>Classification</topic><topic>Computation by Abstract Devices</topic><topic>Computational Biology/Bioinformatics</topic><topic>Datasets</topic><topic>Feature selection</topic><topic>Machine learning</topic><topic>Memory</topic><topic>Neurosciences</topic><topic>Reforestation</topic><topic>Trees</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lulli, Alessandro</creatorcontrib><creatorcontrib>Oneto, Luca</creatorcontrib><creatorcontrib>Anguita, Davide</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><jtitle>Cognitive computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lulli, Alessandro</au><au>Oneto, Luca</au><au>Anguita, Davide</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mining Big Data with Random Forests</atitle><jtitle>Cognitive computation</jtitle><stitle>Cogn Comput</stitle><date>2019-04-01</date><risdate>2019</risdate><volume>11</volume><issue>2</issue><spage>294</spage><epage>316</epage><pages>294-316</pages><issn>1866-9956</issn><eissn>1866-9964</eissn><abstract>In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s12559-018-9615-4</doi><tpages>23</tpages><orcidid>https://orcid.org/0000-0002-8445-395X</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1866-9956
ispartof	Cognitive computation, 2019-04, Vol.11 (2), p.294-316
issn	1866-9956 1866-9964
language	eng
recordid	cdi_proquest_journals_2919516615
source	ProQuest Central UK/Ireland; SpringerLink Journals - AutoHoldings; ProQuest Central
subjects	Accuracy Algorithms Artificial Intelligence Big Data Biomedical and Life Sciences Biomedicine Classification Computation by Abstract Devices Computational Biology/Bioinformatics Datasets Feature selection Machine learning Memory Neurosciences Reforestation Trees
title	Mining Big Data with Random Forests
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T13%3A26%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mining%20Big%20Data%20with%20Random%20Forests&rft.jtitle=Cognitive%20computation&rft.au=Lulli,%20Alessandro&rft.date=2019-04-01&rft.volume=11&rft.issue=2&rft.spage=294&rft.epage=316&rft.pages=294-316&rft.issn=1866-9956&rft.eissn=1866-9964&rft_id=info:doi/10.1007/s12559-018-9615-4&rft_dat=%3Cproquest_cross%3E2919516615%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2919516615&rft_id=info:pmid/&rfr_iscdi=true