Dynamic Replication Policy on HDFS Based on Machine Learning Clustering

Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliabi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2023, Vol.11, p.18551-18559
Hauptverfasser:	Ahmed, Motaz A., Khafagy, Mohamed H., Shaheen, Masoud E., Kaseb, Mostafa R.
Format:	Artikel
Sprache:	eng
Schlagworte:	Availability Big Data Clustering Computer networks Data science Distributed computing Distributed processing Feature extraction File systems Hadoop distributed file system high-performance distributed computing Machine learning reliability Replicability Replication replication policy Support vector machines
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	18559
container_issue
container_start_page	18551
container_title	IEEE access
container_volume	11
creator	Ahmed, Motaz A. Khafagy, Mohamed H. Shaheen, Masoud E. Kaseb, Mostafa R.
description	Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).
doi_str_mv	10.1109/ACCESS.2023.3247190
format	Article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_journals_2780983015</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10049393</ieee_id><doaj_id>oai_doaj_org_article_961a1ff489be4c2784250925823a0329</doaj_id><sourcerecordid>2780983015</sourcerecordid><originalsourceid>FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</originalsourceid><addsrcrecordid>eNpNUctOwzAQjBBIVKVfAIdInFtsrxPbx5I-pSIQhbO1cZziqk2Kkx7697ikQuxlZ1czsytNFN1TMqKUqKdxlk3X6xEjDEbAuKCKXEU9RlM1hATS63_4Nho0zZaEkmGViF40n5wq3DsTv9vDzhlsXV3Fb3WApzigxWS2jp-xscV5ekHz5Sobryz6ylWbONsdm9b6AO-imxJ3jR1cej_6nE0_ssVw9TpfZuPV0HCi2mFeCABWKkxNIVlpTYlS5HluC2SUKkuFESllSiARkoJKuAReYJEjhzLJJfSjZedb1LjVB-_26E-6Rqd_F7XfaPStMzurVUqRliWXKrfcMCE5S4hiiWSABJgKXo-d18HX30fbtHpbH30V3teBTZQEQpPAgo5lfN003pZ_VynR5wB0F4A-B6AvAQTVQ6dy1tp_CsIVKIAfPcx_WQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2780983015</pqid></control><display><type>article</type><title>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</title><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>IEEE Xplore Open Access Journals</source><creator>Ahmed, Motaz A. ; Khafagy, Mohamed H. ; Shaheen, Masoud E. ; Kaseb, Mostafa R.</creator><creatorcontrib>Ahmed, Motaz A. ; Khafagy, Mohamed H. ; Shaheen, Masoud E. ; Kaseb, Mostafa R.</creatorcontrib><description>Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2023.3247190</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Availability ; Big Data ; Clustering ; Computer networks ; Data science ; Distributed computing ; Distributed processing ; Feature extraction ; File systems ; Hadoop distributed file system ; high-performance distributed computing ; Machine learning ; reliability ; Replicability ; Replication ; replication policy ; Support vector machines</subject><ispartof>IEEE access, 2023, Vol.11, p.18551-18559</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</citedby><cites>FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</cites><orcidid>0000-0002-2703-5487 ; 0000-0003-4853-3415 ; 0000-0001-9135-3271 ; 0000-0003-0479-0516</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10049393$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Ahmed, Motaz A.</creatorcontrib><creatorcontrib>Khafagy, Mohamed H.</creatorcontrib><creatorcontrib>Shaheen, Masoud E.</creatorcontrib><creatorcontrib>Kaseb, Mostafa R.</creatorcontrib><title>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</title><title>IEEE access</title><addtitle>Access</addtitle><description>Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).</description><subject>Availability</subject><subject>Big Data</subject><subject>Clustering</subject><subject>Computer networks</subject><subject>Data science</subject><subject>Distributed computing</subject><subject>Distributed processing</subject><subject>Feature extraction</subject><subject>File systems</subject><subject>Hadoop distributed file system</subject><subject>high-performance distributed computing</subject><subject>Machine learning</subject><subject>reliability</subject><subject>Replicability</subject><subject>Replication</subject><subject>replication policy</subject><subject>Support vector machines</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUctOwzAQjBBIVKVfAIdInFtsrxPbx5I-pSIQhbO1cZziqk2Kkx7697ikQuxlZ1czsytNFN1TMqKUqKdxlk3X6xEjDEbAuKCKXEU9RlM1hATS63_4Nho0zZaEkmGViF40n5wq3DsTv9vDzhlsXV3Fb3WApzigxWS2jp-xscV5ekHz5Sobryz6ylWbONsdm9b6AO-imxJ3jR1cej_6nE0_ssVw9TpfZuPV0HCi2mFeCABWKkxNIVlpTYlS5HluC2SUKkuFESllSiARkoJKuAReYJEjhzLJJfSjZedb1LjVB-_26E-6Rqd_F7XfaPStMzurVUqRliWXKrfcMCE5S4hiiWSABJgKXo-d18HX30fbtHpbH30V3teBTZQEQpPAgo5lfN003pZ_VynR5wB0F4A-B6AvAQTVQ6dy1tp_CsIVKIAfPcx_WQ</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Ahmed, Motaz A.</creator><creator>Khafagy, Mohamed H.</creator><creator>Shaheen, Masoud E.</creator><creator>Kaseb, Mostafa R.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-2703-5487</orcidid><orcidid>https://orcid.org/0000-0003-4853-3415</orcidid><orcidid>https://orcid.org/0000-0001-9135-3271</orcidid><orcidid>https://orcid.org/0000-0003-0479-0516</orcidid></search><sort><creationdate>2023</creationdate><title>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</title><author>Ahmed, Motaz A. ; Khafagy, Mohamed H. ; Shaheen, Masoud E. ; Kaseb, Mostafa R.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Availability</topic><topic>Big Data</topic><topic>Clustering</topic><topic>Computer networks</topic><topic>Data science</topic><topic>Distributed computing</topic><topic>Distributed processing</topic><topic>Feature extraction</topic><topic>File systems</topic><topic>Hadoop distributed file system</topic><topic>high-performance distributed computing</topic><topic>Machine learning</topic><topic>reliability</topic><topic>Replicability</topic><topic>Replication</topic><topic>replication policy</topic><topic>Support vector machines</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ahmed, Motaz A.</creatorcontrib><creatorcontrib>Khafagy, Mohamed H.</creatorcontrib><creatorcontrib>Shaheen, Masoud E.</creatorcontrib><creatorcontrib>Kaseb, Mostafa R.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ahmed, Motaz A.</au><au>Khafagy, Mohamed H.</au><au>Shaheen, Masoud E.</au><au>Kaseb, Mostafa R.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2023</date><risdate>2023</risdate><volume>11</volume><spage>18551</spage><epage>18559</epage><pages>18551-18559</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2023.3247190</doi><tpages>9</tpages><orcidid>https://orcid.org/0000-0002-2703-5487</orcidid><orcidid>https://orcid.org/0000-0003-4853-3415</orcidid><orcidid>https://orcid.org/0000-0001-9135-3271</orcidid><orcidid>https://orcid.org/0000-0003-0479-0516</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2023, Vol.11, p.18551-18559
issn	2169-3536 2169-3536
language	eng
recordid	cdi_proquest_journals_2780983015
source	DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals; IEEE Xplore Open Access Journals
subjects	Availability Big Data Clustering Computer networks Data science Distributed computing Distributed processing Feature extraction File systems Hadoop distributed file system high-performance distributed computing Machine learning reliability Replicability Replication replication policy Support vector machines
title	Dynamic Replication Policy on HDFS Based on Machine Learning Clustering
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T23%3A25%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Dynamic%20Replication%20Policy%20on%20HDFS%20Based%20on%20Machine%20Learning%20Clustering&rft.jtitle=IEEE%20access&rft.au=Ahmed,%20Motaz%20A.&rft.date=2023&rft.volume=11&rft.spage=18551&rft.epage=18559&rft.pages=18551-18559&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2023.3247190&rft_dat=%3Cproquest_ieee_%3E2780983015%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2780983015&rft_id=info:pmid/&rft_ieee_id=10049393&rft_doaj_id=oai_doaj_org_article_961a1ff489be4c2784250925823a0329&rfr_iscdi=true