A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm

With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Mac...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of advanced computer science & applications 2021, Vol.12 (4)
Hauptverfasser: BENLACHIMI, YASSINE, EL, ABDELAZIZ, LAHCEN, MOULAY
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 4
container_start_page
container_title International journal of advanced computer science & applications
container_volume 12
creator BENLACHIMI, YASSINE
EL, ABDELAZIZ
LAHCEN, MOULAY
description With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Machine Learning (ML) data, real-time streaming data, and graphics processing. The most efficient solutions to Big Data analysis in a distributed environment are Hadoop and Spark administered by Apache, both these solutions are open-source data management frameworks and they allow to distribute and compute the large datasets across multiple clusters of computing nodes. This paper provides a comprehensive comparison between Apache Hadoop & Apache Spark in terms of efficiency, scalability, security, cost-effectiveness, and other parameters. It describes primary components of Hadoop and Spark frameworks to compare their performance. The major conclusion is that Spark is better in terms of scalability and speed for real-time streaming applications; whereas, Hadoop is more viable for applications dealing with bigger datasets. This case study evaluates the performance of various components of Hadoop-such, MapReduce, and Hadoop Distributed File System (HDFS) by applying it to the well-known Word Count algorithm to ascertain its efficacy in terms of storage and computational time. Subsequently, it also provides an analysis of how Spark’s in-line memory processing could reduce the computational time of the Word Count Algorithm.
doi_str_mv 10.14569/IJACSA.2021.0120495
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2655119528</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2655119528</sourcerecordid><originalsourceid>FETCH-LOGICAL-c391t-36aaa2660fa87a6a659d2c863add4f27d94792f2c2771cfc0a30caa6a70ad9e03</originalsourceid><addsrcrecordid>eNotkF1LwzAUhoMoOOb-gRcBrzvz0STNZSnOTQSFKXoXDkk7u61NTVpl_9667dycc_Gcl5cHoVtK5jQVUt-vnvJinc8ZYXROKCOpFhdowqiQiRCKXB7vLKFEfV6jWYxbMg7XTGZ8gl5zXPimgwB9_VPivIX9IdYR-wovwXnfYWgdXo_ADi8CNOWvD7uIh1i3G_zhgxvfh7bH-X7jQ91_NTfoqoJ9LGfnPUXvi4e3Ypk8vzyuivw5sVzTPuESAJiUpIJMgQQptGM2kxycSyumnE6VZhWzTClqK0uAEwsjqAg4XRI-RXen3C7476GMvdn6IYz1o2FSCEq1YNlIpSfKBh9jKCvThbqBcDCUmKM-c9Jn_vWZsz7-BwFBYs0</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2655119528</pqid></control><display><type>article</type><title>A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm</title><source>EZB-FREE-00999 freely available EZB journals</source><creator>BENLACHIMI, YASSINE ; EL, ABDELAZIZ ; LAHCEN, MOULAY</creator><creatorcontrib>BENLACHIMI, YASSINE ; EL, ABDELAZIZ ; LAHCEN, MOULAY</creatorcontrib><description>With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Machine Learning (ML) data, real-time streaming data, and graphics processing. The most efficient solutions to Big Data analysis in a distributed environment are Hadoop and Spark administered by Apache, both these solutions are open-source data management frameworks and they allow to distribute and compute the large datasets across multiple clusters of computing nodes. This paper provides a comprehensive comparison between Apache Hadoop &amp; Apache Spark in terms of efficiency, scalability, security, cost-effectiveness, and other parameters. It describes primary components of Hadoop and Spark frameworks to compare their performance. The major conclusion is that Spark is better in terms of scalability and speed for real-time streaming applications; whereas, Hadoop is more viable for applications dealing with bigger datasets. This case study evaluates the performance of various components of Hadoop-such, MapReduce, and Hadoop Distributed File System (HDFS) by applying it to the well-known Word Count algorithm to ascertain its efficacy in terms of storage and computational time. Subsequently, it also provides an analysis of how Spark’s in-line memory processing could reduce the computational time of the Word Count Algorithm.</description><identifier>ISSN: 2158-107X</identifier><identifier>EISSN: 2156-5570</identifier><identifier>DOI: 10.14569/IJACSA.2021.0120495</identifier><language>eng</language><publisher>West Yorkshire: Science and Information (SAI) Organization Limited</publisher><subject>Algorithms ; Big Data ; Computational efficiency ; Computing time ; Cost analysis ; Data analysis ; Data management ; Datasets ; Information technology ; Machine learning ; Performance evaluation ; Real time</subject><ispartof>International journal of advanced computer science &amp; applications, 2021, Vol.12 (4)</ispartof><rights>2021. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c391t-36aaa2660fa87a6a659d2c863add4f27d94792f2c2771cfc0a30caa6a70ad9e03</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,4009,27902,27903,27904</link.rule.ids></links><search><creatorcontrib>BENLACHIMI, YASSINE</creatorcontrib><creatorcontrib>EL, ABDELAZIZ</creatorcontrib><creatorcontrib>LAHCEN, MOULAY</creatorcontrib><title>A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm</title><title>International journal of advanced computer science &amp; applications</title><description>With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Machine Learning (ML) data, real-time streaming data, and graphics processing. The most efficient solutions to Big Data analysis in a distributed environment are Hadoop and Spark administered by Apache, both these solutions are open-source data management frameworks and they allow to distribute and compute the large datasets across multiple clusters of computing nodes. This paper provides a comprehensive comparison between Apache Hadoop &amp; Apache Spark in terms of efficiency, scalability, security, cost-effectiveness, and other parameters. It describes primary components of Hadoop and Spark frameworks to compare their performance. The major conclusion is that Spark is better in terms of scalability and speed for real-time streaming applications; whereas, Hadoop is more viable for applications dealing with bigger datasets. This case study evaluates the performance of various components of Hadoop-such, MapReduce, and Hadoop Distributed File System (HDFS) by applying it to the well-known Word Count algorithm to ascertain its efficacy in terms of storage and computational time. Subsequently, it also provides an analysis of how Spark’s in-line memory processing could reduce the computational time of the Word Count Algorithm.</description><subject>Algorithms</subject><subject>Big Data</subject><subject>Computational efficiency</subject><subject>Computing time</subject><subject>Cost analysis</subject><subject>Data analysis</subject><subject>Data management</subject><subject>Datasets</subject><subject>Information technology</subject><subject>Machine learning</subject><subject>Performance evaluation</subject><subject>Real time</subject><issn>2158-107X</issn><issn>2156-5570</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNotkF1LwzAUhoMoOOb-gRcBrzvz0STNZSnOTQSFKXoXDkk7u61NTVpl_9667dycc_Gcl5cHoVtK5jQVUt-vnvJinc8ZYXROKCOpFhdowqiQiRCKXB7vLKFEfV6jWYxbMg7XTGZ8gl5zXPimgwB9_VPivIX9IdYR-wovwXnfYWgdXo_ADi8CNOWvD7uIh1i3G_zhgxvfh7bH-X7jQ91_NTfoqoJ9LGfnPUXvi4e3Ypk8vzyuivw5sVzTPuESAJiUpIJMgQQptGM2kxycSyumnE6VZhWzTClqK0uAEwsjqAg4XRI-RXen3C7476GMvdn6IYz1o2FSCEq1YNlIpSfKBh9jKCvThbqBcDCUmKM-c9Jn_vWZsz7-BwFBYs0</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>BENLACHIMI, YASSINE</creator><creator>EL, ABDELAZIZ</creator><creator>LAHCEN, MOULAY</creator><general>Science and Information (SAI) Organization Limited</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7XB</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope></search><sort><creationdate>2021</creationdate><title>A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm</title><author>BENLACHIMI, YASSINE ; EL, ABDELAZIZ ; LAHCEN, MOULAY</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c391t-36aaa2660fa87a6a659d2c863add4f27d94792f2c2771cfc0a30caa6a70ad9e03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Big Data</topic><topic>Computational efficiency</topic><topic>Computing time</topic><topic>Cost analysis</topic><topic>Data analysis</topic><topic>Data management</topic><topic>Datasets</topic><topic>Information technology</topic><topic>Machine learning</topic><topic>Performance evaluation</topic><topic>Real time</topic><toplevel>online_resources</toplevel><creatorcontrib>BENLACHIMI, YASSINE</creatorcontrib><creatorcontrib>EL, ABDELAZIZ</creatorcontrib><creatorcontrib>LAHCEN, MOULAY</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><jtitle>International journal of advanced computer science &amp; applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>BENLACHIMI, YASSINE</au><au>EL, ABDELAZIZ</au><au>LAHCEN, MOULAY</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm</atitle><jtitle>International journal of advanced computer science &amp; applications</jtitle><date>2021</date><risdate>2021</risdate><volume>12</volume><issue>4</issue><issn>2158-107X</issn><eissn>2156-5570</eissn><abstract>With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Machine Learning (ML) data, real-time streaming data, and graphics processing. The most efficient solutions to Big Data analysis in a distributed environment are Hadoop and Spark administered by Apache, both these solutions are open-source data management frameworks and they allow to distribute and compute the large datasets across multiple clusters of computing nodes. This paper provides a comprehensive comparison between Apache Hadoop &amp; Apache Spark in terms of efficiency, scalability, security, cost-effectiveness, and other parameters. It describes primary components of Hadoop and Spark frameworks to compare their performance. The major conclusion is that Spark is better in terms of scalability and speed for real-time streaming applications; whereas, Hadoop is more viable for applications dealing with bigger datasets. This case study evaluates the performance of various components of Hadoop-such, MapReduce, and Hadoop Distributed File System (HDFS) by applying it to the well-known Word Count algorithm to ascertain its efficacy in terms of storage and computational time. Subsequently, it also provides an analysis of how Spark’s in-line memory processing could reduce the computational time of the Word Count Algorithm.</abstract><cop>West Yorkshire</cop><pub>Science and Information (SAI) Organization Limited</pub><doi>10.14569/IJACSA.2021.0120495</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2158-107X
ispartof International journal of advanced computer science & applications, 2021, Vol.12 (4)
issn 2158-107X
2156-5570
language eng
recordid cdi_proquest_journals_2655119528
source EZB-FREE-00999 freely available EZB journals
subjects Algorithms
Big Data
Computational efficiency
Computing time
Cost analysis
Data analysis
Data management
Datasets
Information technology
Machine learning
Performance evaluation
Real time
title A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T16%3A16%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Comparative%20Analysis%20of%20Hadoop%20and%20Spark%20Frameworks%20using%20Word%20Count%20Algorithm&rft.jtitle=International%20journal%20of%20advanced%20computer%20science%20&%20applications&rft.au=BENLACHIMI,%20YASSINE&rft.date=2021&rft.volume=12&rft.issue=4&rft.issn=2158-107X&rft.eissn=2156-5570&rft_id=info:doi/10.14569/IJACSA.2021.0120495&rft_dat=%3Cproquest_cross%3E2655119528%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2655119528&rft_id=info:pmid/&rfr_iscdi=true