Dynamic Replication Policy on HDFS Based on Machine Learning Clustering

Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliabi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2023, Vol.11, p.18551-18559
Hauptverfasser: Ahmed, Motaz A., Khafagy, Mohamed H., Shaheen, Masoud E., Kaseb, Mostafa R.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 18559
container_issue
container_start_page 18551
container_title IEEE access
container_volume 11
creator Ahmed, Motaz A.
Khafagy, Mohamed H.
Shaheen, Masoud E.
Kaseb, Mostafa R.
description Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).
doi_str_mv 10.1109/ACCESS.2023.3247190
format Article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_journals_2780983015</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10049393</ieee_id><doaj_id>oai_doaj_org_article_961a1ff489be4c2784250925823a0329</doaj_id><sourcerecordid>2780983015</sourcerecordid><originalsourceid>FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</originalsourceid><addsrcrecordid>eNpNUctOwzAQjBBIVKVfAIdInFtsrxPbx5I-pSIQhbO1cZziqk2Kkx7697ikQuxlZ1czsytNFN1TMqKUqKdxlk3X6xEjDEbAuKCKXEU9RlM1hATS63_4Nho0zZaEkmGViF40n5wq3DsTv9vDzhlsXV3Fb3WApzigxWS2jp-xscV5ekHz5Sobryz6ylWbONsdm9b6AO-imxJ3jR1cej_6nE0_ssVw9TpfZuPV0HCi2mFeCABWKkxNIVlpTYlS5HluC2SUKkuFESllSiARkoJKuAReYJEjhzLJJfSjZedb1LjVB-_26E-6Rqd_F7XfaPStMzurVUqRliWXKrfcMCE5S4hiiWSABJgKXo-d18HX30fbtHpbH30V3teBTZQEQpPAgo5lfN003pZ_VynR5wB0F4A-B6AvAQTVQ6dy1tp_CsIVKIAfPcx_WQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2780983015</pqid></control><display><type>article</type><title>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</title><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>IEEE Xplore Open Access Journals</source><creator>Ahmed, Motaz A. ; Khafagy, Mohamed H. ; Shaheen, Masoud E. ; Kaseb, Mostafa R.</creator><creatorcontrib>Ahmed, Motaz A. ; Khafagy, Mohamed H. ; Shaheen, Masoud E. ; Kaseb, Mostafa R.</creatorcontrib><description>Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2023.3247190</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Availability ; Big Data ; Clustering ; Computer networks ; Data science ; Distributed computing ; Distributed processing ; Feature extraction ; File systems ; Hadoop distributed file system ; high-performance distributed computing ; Machine learning ; reliability ; Replicability ; Replication ; replication policy ; Support vector machines</subject><ispartof>IEEE access, 2023, Vol.11, p.18551-18559</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</citedby><cites>FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</cites><orcidid>0000-0002-2703-5487 ; 0000-0003-4853-3415 ; 0000-0001-9135-3271 ; 0000-0003-0479-0516</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10049393$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Ahmed, Motaz A.</creatorcontrib><creatorcontrib>Khafagy, Mohamed H.</creatorcontrib><creatorcontrib>Shaheen, Masoud E.</creatorcontrib><creatorcontrib>Kaseb, Mostafa R.</creatorcontrib><title>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</title><title>IEEE access</title><addtitle>Access</addtitle><description>Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).</description><subject>Availability</subject><subject>Big Data</subject><subject>Clustering</subject><subject>Computer networks</subject><subject>Data science</subject><subject>Distributed computing</subject><subject>Distributed processing</subject><subject>Feature extraction</subject><subject>File systems</subject><subject>Hadoop distributed file system</subject><subject>high-performance distributed computing</subject><subject>Machine learning</subject><subject>reliability</subject><subject>Replicability</subject><subject>Replication</subject><subject>replication policy</subject><subject>Support vector machines</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUctOwzAQjBBIVKVfAIdInFtsrxPbx5I-pSIQhbO1cZziqk2Kkx7697ikQuxlZ1czsytNFN1TMqKUqKdxlk3X6xEjDEbAuKCKXEU9RlM1hATS63_4Nho0zZaEkmGViF40n5wq3DsTv9vDzhlsXV3Fb3WApzigxWS2jp-xscV5ekHz5Sobryz6ylWbONsdm9b6AO-imxJ3jR1cej_6nE0_ssVw9TpfZuPV0HCi2mFeCABWKkxNIVlpTYlS5HluC2SUKkuFESllSiARkoJKuAReYJEjhzLJJfSjZedb1LjVB-_26E-6Rqd_F7XfaPStMzurVUqRliWXKrfcMCE5S4hiiWSABJgKXo-d18HX30fbtHpbH30V3teBTZQEQpPAgo5lfN003pZ_VynR5wB0F4A-B6AvAQTVQ6dy1tp_CsIVKIAfPcx_WQ</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Ahmed, Motaz A.</creator><creator>Khafagy, Mohamed H.</creator><creator>Shaheen, Masoud E.</creator><creator>Kaseb, Mostafa R.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-2703-5487</orcidid><orcidid>https://orcid.org/0000-0003-4853-3415</orcidid><orcidid>https://orcid.org/0000-0001-9135-3271</orcidid><orcidid>https://orcid.org/0000-0003-0479-0516</orcidid></search><sort><creationdate>2023</creationdate><title>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</title><author>Ahmed, Motaz A. ; Khafagy, Mohamed H. ; Shaheen, Masoud E. ; Kaseb, Mostafa R.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Availability</topic><topic>Big Data</topic><topic>Clustering</topic><topic>Computer networks</topic><topic>Data science</topic><topic>Distributed computing</topic><topic>Distributed processing</topic><topic>Feature extraction</topic><topic>File systems</topic><topic>Hadoop distributed file system</topic><topic>high-performance distributed computing</topic><topic>Machine learning</topic><topic>reliability</topic><topic>Replicability</topic><topic>Replication</topic><topic>replication policy</topic><topic>Support vector machines</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ahmed, Motaz A.</creatorcontrib><creatorcontrib>Khafagy, Mohamed H.</creatorcontrib><creatorcontrib>Shaheen, Masoud E.</creatorcontrib><creatorcontrib>Kaseb, Mostafa R.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ahmed, Motaz A.</au><au>Khafagy, Mohamed H.</au><au>Shaheen, Masoud E.</au><au>Kaseb, Mostafa R.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2023</date><risdate>2023</risdate><volume>11</volume><spage>18551</spage><epage>18559</epage><pages>18551-18559</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2023.3247190</doi><tpages>9</tpages><orcidid>https://orcid.org/0000-0002-2703-5487</orcidid><orcidid>https://orcid.org/0000-0003-4853-3415</orcidid><orcidid>https://orcid.org/0000-0001-9135-3271</orcidid><orcidid>https://orcid.org/0000-0003-0479-0516</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2023, Vol.11, p.18551-18559
issn 2169-3536
2169-3536
language eng
recordid cdi_proquest_journals_2780983015
source DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals; IEEE Xplore Open Access Journals
subjects Availability
Big Data
Clustering
Computer networks
Data science
Distributed computing
Distributed processing
Feature extraction
File systems
Hadoop distributed file system
high-performance distributed computing
Machine learning
reliability
Replicability
Replication
replication policy
Support vector machines
title Dynamic Replication Policy on HDFS Based on Machine Learning Clustering
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T23%3A25%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Dynamic%20Replication%20Policy%20on%20HDFS%20Based%20on%20Machine%20Learning%20Clustering&rft.jtitle=IEEE%20access&rft.au=Ahmed,%20Motaz%20A.&rft.date=2023&rft.volume=11&rft.spage=18551&rft.epage=18559&rft.pages=18551-18559&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2023.3247190&rft_dat=%3Cproquest_ieee_%3E2780983015%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2780983015&rft_id=info:pmid/&rft_ieee_id=10049393&rft_doaj_id=oai_doaj_org_article_961a1ff489be4c2784250925823a0329&rfr_iscdi=true