An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Of the many techniques available, feature selection (FS) is of growing interest for its ability to identify both relevant features and frequently repeated instances in huge datasets. W...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on systems, man, and cybernetics. Systems man, and cybernetics. Systems, 2018-09, Vol.48 (9), p.1441-1453
Hauptverfasser:	Ramirez-Gallego, Sergio, Mourino-Talin, Hector, Martinez-Rego, David, Bolon-Canedo, Veronica, Benitez, Jose Manuel, Alonso-Betanzos, Amparo, Herrera, Francisco
Format:	Artikel
Sprache:	eng
Schlagworte:	Apache spark Big Data Data management Data mining Datasets Distributed databases Feature extraction feature selection (FS) filtering methods high-dimensional Information theory Programming Sparks
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1453
container_issue	9
container_start_page	1441
container_title	IEEE transactions on systems, man, and cybernetics. Systems
container_volume	48
creator	Ramirez-Gallego, Sergio Mourino-Talin, Hector Martinez-Rego, David Bolon-Canedo, Veronica Benitez, Jose Manuel Alonso-Betanzos, Amparo Herrera, Francisco
description	With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Of the many techniques available, feature selection (FS) is of growing interest for its ability to identify both relevant features and frequently repeated instances in huge datasets. We aim to demonstrate that standard FS methods can be parallelized in big data platforms like Apache Spark so as to boost both performance and accuracy. We propose a distributed implementation of a generic FS framework that includes a broad group of well-known information theory-based methods. Experimental results for a broad set of real-world datasets show that our distributed framework is capable of rapidly dealing with ultrahigh-dimensional datasets as well as those with a huge number of samples, outperforming the sequential version in all the cases studied.
doi_str_mv	10.1109/TSMC.2017.2670926
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_7970198</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7970198</ieee_id><sourcerecordid>2117132006</sourcerecordid><originalsourceid>FETCH-LOGICAL-c293t-4ec202fcb46a4d4ecf9906f85c5ef720ea7c8f45c96409a5719cb8054a350f843</originalsourceid><addsrcrecordid>eNo9kE1PAjEQhhujiQT5AcZLE8-L03a33R4BRUkwHoB4bEqZyvKxXbtLDP_eRYinmck870zyEHLPoM8Y6Kf57H3U58BUn0sFmssr0uFM5gnngl__90zekl5dbwCA8VwKkB3yOSjppPQh7m1ThJLO1xjiMRnaGld0jLY5RKQz3KH7W4-j3eNPiFvaRuiw-KLPtrF0Ua4w0kFl3bqlKxu3d-TG212NvUvtksX4ZT56S6Yfr5PRYJo4rkWTpOg4cO-WqbTpqp281iB9nrkMveKAVrncp5nTMgVtM8W0W-aQpVZk4PNUdMnj-W4Vw_cB68ZswiGW7UvDGVNMcADZUuxMuRjqOqI3VSz2Nh4NA3NSaE4KzUmhuShsMw_nTIGI_7zSCpjOxS95pmuS</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2117132006</pqid></control><display><type>article</type><title>An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark</title><source>IEEE Electronic Library (IEL)</source><creator>Ramirez-Gallego, Sergio ; Mourino-Talin, Hector ; Martinez-Rego, David ; Bolon-Canedo, Veronica ; Benitez, Jose Manuel ; Alonso-Betanzos, Amparo ; Herrera, Francisco</creator><creatorcontrib>Ramirez-Gallego, Sergio ; Mourino-Talin, Hector ; Martinez-Rego, David ; Bolon-Canedo, Veronica ; Benitez, Jose Manuel ; Alonso-Betanzos, Amparo ; Herrera, Francisco</creatorcontrib><description>With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Of the many techniques available, feature selection (FS) is of growing interest for its ability to identify both relevant features and frequently repeated instances in huge datasets. We aim to demonstrate that standard FS methods can be parallelized in big data platforms like Apache Spark so as to boost both performance and accuracy. We propose a distributed implementation of a generic FS framework that includes a broad group of well-known information theory-based methods. Experimental results for a broad set of real-world datasets show that our distributed framework is capable of rapidly dealing with ultrahigh-dimensional datasets as well as those with a huge number of samples, outperforming the sequential version in all the cases studied.</description><identifier>ISSN: 2168-2216</identifier><identifier>EISSN: 2168-2232</identifier><identifier>DOI: 10.1109/TSMC.2017.2670926</identifier><identifier>CODEN: ITSMFE</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Apache spark ; Big Data ; Data management ; Data mining ; Datasets ; Distributed databases ; Feature extraction ; feature selection (FS) ; filtering methods ; high-dimensional ; Information theory ; Programming ; Sparks</subject><ispartof>IEEE transactions on systems, man, and cybernetics. Systems, 2018-09, Vol.48 (9), p.1441-1453</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2018</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c293t-4ec202fcb46a4d4ecf9906f85c5ef720ea7c8f45c96409a5719cb8054a350f843</citedby><cites>FETCH-LOGICAL-c293t-4ec202fcb46a4d4ecf9906f85c5ef720ea7c8f45c96409a5719cb8054a350f843</cites><orcidid>0000-0003-4804-5884 ; 0000-0003-0950-0012 ; 0000-0002-7283-312X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7970198$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7970198$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ramirez-Gallego, Sergio</creatorcontrib><creatorcontrib>Mourino-Talin, Hector</creatorcontrib><creatorcontrib>Martinez-Rego, David</creatorcontrib><creatorcontrib>Bolon-Canedo, Veronica</creatorcontrib><creatorcontrib>Benitez, Jose Manuel</creatorcontrib><creatorcontrib>Alonso-Betanzos, Amparo</creatorcontrib><creatorcontrib>Herrera, Francisco</creatorcontrib><title>An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark</title><title>IEEE transactions on systems, man, and cybernetics. Systems</title><addtitle>TSMC</addtitle><description>With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Of the many techniques available, feature selection (FS) is of growing interest for its ability to identify both relevant features and frequently repeated instances in huge datasets. We aim to demonstrate that standard FS methods can be parallelized in big data platforms like Apache Spark so as to boost both performance and accuracy. We propose a distributed implementation of a generic FS framework that includes a broad group of well-known information theory-based methods. Experimental results for a broad set of real-world datasets show that our distributed framework is capable of rapidly dealing with ultrahigh-dimensional datasets as well as those with a huge number of samples, outperforming the sequential version in all the cases studied.</description><subject>Apache spark</subject><subject>Big Data</subject><subject>Data management</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Distributed databases</subject><subject>Feature extraction</subject><subject>feature selection (FS)</subject><subject>filtering methods</subject><subject>high-dimensional</subject><subject>Information theory</subject><subject>Programming</subject><subject>Sparks</subject><issn>2168-2216</issn><issn>2168-2232</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1PAjEQhhujiQT5AcZLE8-L03a33R4BRUkwHoB4bEqZyvKxXbtLDP_eRYinmck870zyEHLPoM8Y6Kf57H3U58BUn0sFmssr0uFM5gnngl__90zekl5dbwCA8VwKkB3yOSjppPQh7m1ThJLO1xjiMRnaGld0jLY5RKQz3KH7W4-j3eNPiFvaRuiw-KLPtrF0Ua4w0kFl3bqlKxu3d-TG212NvUvtksX4ZT56S6Yfr5PRYJo4rkWTpOg4cO-WqbTpqp281iB9nrkMveKAVrncp5nTMgVtM8W0W-aQpVZk4PNUdMnj-W4Vw_cB68ZswiGW7UvDGVNMcADZUuxMuRjqOqI3VSz2Nh4NA3NSaE4KzUmhuShsMw_nTIGI_7zSCpjOxS95pmuS</recordid><startdate>20180901</startdate><enddate>20180901</enddate><creator>Ramirez-Gallego, Sergio</creator><creator>Mourino-Talin, Hector</creator><creator>Martinez-Rego, David</creator><creator>Bolon-Canedo, Veronica</creator><creator>Benitez, Jose Manuel</creator><creator>Alonso-Betanzos, Amparo</creator><creator>Herrera, Francisco</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7TB</scope><scope>8FD</scope><scope>FR3</scope><scope>H8D</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-4804-5884</orcidid><orcidid>https://orcid.org/0000-0003-0950-0012</orcidid><orcidid>https://orcid.org/0000-0002-7283-312X</orcidid></search><sort><creationdate>20180901</creationdate><title>An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark</title><author>Ramirez-Gallego, Sergio ; Mourino-Talin, Hector ; Martinez-Rego, David ; Bolon-Canedo, Veronica ; Benitez, Jose Manuel ; Alonso-Betanzos, Amparo ; Herrera, Francisco</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c293t-4ec202fcb46a4d4ecf9906f85c5ef720ea7c8f45c96409a5719cb8054a350f843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Apache spark</topic><topic>Big Data</topic><topic>Data management</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Distributed databases</topic><topic>Feature extraction</topic><topic>feature selection (FS)</topic><topic>filtering methods</topic><topic>high-dimensional</topic><topic>Information theory</topic><topic>Programming</topic><topic>Sparks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ramirez-Gallego, Sergio</creatorcontrib><creatorcontrib>Mourino-Talin, Hector</creatorcontrib><creatorcontrib>Martinez-Rego, David</creatorcontrib><creatorcontrib>Bolon-Canedo, Veronica</creatorcontrib><creatorcontrib>Benitez, Jose Manuel</creatorcontrib><creatorcontrib>Alonso-Betanzos, Amparo</creatorcontrib><creatorcontrib>Herrera, Francisco</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on systems, man, and cybernetics. Systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ramirez-Gallego, Sergio</au><au>Mourino-Talin, Hector</au><au>Martinez-Rego, David</au><au>Bolon-Canedo, Veronica</au><au>Benitez, Jose Manuel</au><au>Alonso-Betanzos, Amparo</au><au>Herrera, Francisco</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark</atitle><jtitle>IEEE transactions on systems, man, and cybernetics. Systems</jtitle><stitle>TSMC</stitle><date>2018-09-01</date><risdate>2018</risdate><volume>48</volume><issue>9</issue><spage>1441</spage><epage>1453</epage><pages>1441-1453</pages><issn>2168-2216</issn><eissn>2168-2232</eissn><coden>ITSMFE</coden><abstract>With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Of the many techniques available, feature selection (FS) is of growing interest for its ability to identify both relevant features and frequently repeated instances in huge datasets. We aim to demonstrate that standard FS methods can be parallelized in big data platforms like Apache Spark so as to boost both performance and accuracy. We propose a distributed implementation of a generic FS framework that includes a broad group of well-known information theory-based methods. Experimental results for a broad set of real-world datasets show that our distributed framework is capable of rapidly dealing with ultrahigh-dimensional datasets as well as those with a huge number of samples, outperforming the sequential version in all the cases studied.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSMC.2017.2670926</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0003-4804-5884</orcidid><orcidid>https://orcid.org/0000-0003-0950-0012</orcidid><orcidid>https://orcid.org/0000-0002-7283-312X</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2168-2216
ispartof	IEEE transactions on systems, man, and cybernetics. Systems, 2018-09, Vol.48 (9), p.1441-1453
issn	2168-2216 2168-2232
language	eng
recordid	cdi_ieee_primary_7970198
source	IEEE Electronic Library (IEL)
subjects	Apache spark Big Data Data management Data mining Datasets Distributed databases Feature extraction feature selection (FS) filtering methods high-dimensional Information theory Programming Sparks
title	An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T02%3A17%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Information%20Theory-Based%20Feature%20Selection%20Framework%20for%20Big%20Data%20Under%20Apache%20Spark&rft.jtitle=IEEE%20transactions%20on%20systems,%20man,%20and%20cybernetics.%20Systems&rft.au=Ramirez-Gallego,%20Sergio&rft.date=2018-09-01&rft.volume=48&rft.issue=9&rft.spage=1441&rft.epage=1453&rft.pages=1441-1453&rft.issn=2168-2216&rft.eissn=2168-2232&rft.coden=ITSMFE&rft_id=info:doi/10.1109/TSMC.2017.2670926&rft_dat=%3Cproquest_RIE%3E2117132006%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2117132006&rft_id=info:pmid/&rft_ieee_id=7970198&rfr_iscdi=true