Dynamic Replication Policy on HDFS Based on Machine Learning Clustering
Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliabi...
Gespeichert in:
Veröffentlicht in: | IEEE access 2023, Vol.11, p.18551-18559 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 18559 |
---|---|
container_issue | |
container_start_page | 18551 |
container_title | IEEE access |
container_volume | 11 |
creator | Ahmed, Motaz A. Khafagy, Mohamed H. Shaheen, Masoud E. Kaseb, Mostafa R. |
description | Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC). |
doi_str_mv | 10.1109/ACCESS.2023.3247190 |
format | Article |
fullrecord | <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_journals_2780983015</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10049393</ieee_id><doaj_id>oai_doaj_org_article_961a1ff489be4c2784250925823a0329</doaj_id><sourcerecordid>2780983015</sourcerecordid><originalsourceid>FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</originalsourceid><addsrcrecordid>eNpNUctOwzAQjBBIVKVfAIdInFtsrxPbx5I-pSIQhbO1cZziqk2Kkx7697ikQuxlZ1czsytNFN1TMqKUqKdxlk3X6xEjDEbAuKCKXEU9RlM1hATS63_4Nho0zZaEkmGViF40n5wq3DsTv9vDzhlsXV3Fb3WApzigxWS2jp-xscV5ekHz5Sobryz6ylWbONsdm9b6AO-imxJ3jR1cej_6nE0_ssVw9TpfZuPV0HCi2mFeCABWKkxNIVlpTYlS5HluC2SUKkuFESllSiARkoJKuAReYJEjhzLJJfSjZedb1LjVB-_26E-6Rqd_F7XfaPStMzurVUqRliWXKrfcMCE5S4hiiWSABJgKXo-d18HX30fbtHpbH30V3teBTZQEQpPAgo5lfN003pZ_VynR5wB0F4A-B6AvAQTVQ6dy1tp_CsIVKIAfPcx_WQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2780983015</pqid></control><display><type>article</type><title>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</title><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>IEEE Xplore Open Access Journals</source><creator>Ahmed, Motaz A. ; Khafagy, Mohamed H. ; Shaheen, Masoud E. ; Kaseb, Mostafa R.</creator><creatorcontrib>Ahmed, Motaz A. ; Khafagy, Mohamed H. ; Shaheen, Masoud E. ; Kaseb, Mostafa R.</creatorcontrib><description>Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2023.3247190</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Availability ; Big Data ; Clustering ; Computer networks ; Data science ; Distributed computing ; Distributed processing ; Feature extraction ; File systems ; Hadoop distributed file system ; high-performance distributed computing ; Machine learning ; reliability ; Replicability ; Replication ; replication policy ; Support vector machines</subject><ispartof>IEEE access, 2023, Vol.11, p.18551-18559</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</citedby><cites>FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</cites><orcidid>0000-0002-2703-5487 ; 0000-0003-4853-3415 ; 0000-0001-9135-3271 ; 0000-0003-0479-0516</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10049393$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Ahmed, Motaz A.</creatorcontrib><creatorcontrib>Khafagy, Mohamed H.</creatorcontrib><creatorcontrib>Shaheen, Masoud E.</creatorcontrib><creatorcontrib>Kaseb, Mostafa R.</creatorcontrib><title>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</title><title>IEEE access</title><addtitle>Access</addtitle><description>Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).</description><subject>Availability</subject><subject>Big Data</subject><subject>Clustering</subject><subject>Computer networks</subject><subject>Data science</subject><subject>Distributed computing</subject><subject>Distributed processing</subject><subject>Feature extraction</subject><subject>File systems</subject><subject>Hadoop distributed file system</subject><subject>high-performance distributed computing</subject><subject>Machine learning</subject><subject>reliability</subject><subject>Replicability</subject><subject>Replication</subject><subject>replication policy</subject><subject>Support vector machines</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUctOwzAQjBBIVKVfAIdInFtsrxPbx5I-pSIQhbO1cZziqk2Kkx7697ikQuxlZ1czsytNFN1TMqKUqKdxlk3X6xEjDEbAuKCKXEU9RlM1hATS63_4Nho0zZaEkmGViF40n5wq3DsTv9vDzhlsXV3Fb3WApzigxWS2jp-xscV5ekHz5Sobryz6ylWbONsdm9b6AO-imxJ3jR1cej_6nE0_ssVw9TpfZuPV0HCi2mFeCABWKkxNIVlpTYlS5HluC2SUKkuFESllSiARkoJKuAReYJEjhzLJJfSjZedb1LjVB-_26E-6Rqd_F7XfaPStMzurVUqRliWXKrfcMCE5S4hiiWSABJgKXo-d18HX30fbtHpbH30V3teBTZQEQpPAgo5lfN003pZ_VynR5wB0F4A-B6AvAQTVQ6dy1tp_CsIVKIAfPcx_WQ</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Ahmed, Motaz A.</creator><creator>Khafagy, Mohamed H.</creator><creator>Shaheen, Masoud E.</creator><creator>Kaseb, Mostafa R.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-2703-5487</orcidid><orcidid>https://orcid.org/0000-0003-4853-3415</orcidid><orcidid>https://orcid.org/0000-0001-9135-3271</orcidid><orcidid>https://orcid.org/0000-0003-0479-0516</orcidid></search><sort><creationdate>2023</creationdate><title>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</title><author>Ahmed, Motaz A. ; Khafagy, Mohamed H. ; Shaheen, Masoud E. ; Kaseb, Mostafa R.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c409t-bd7332f9a6cd82fecfa87bbbeda2119e17c761297a07813954834dadba43f5b83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Availability</topic><topic>Big Data</topic><topic>Clustering</topic><topic>Computer networks</topic><topic>Data science</topic><topic>Distributed computing</topic><topic>Distributed processing</topic><topic>Feature extraction</topic><topic>File systems</topic><topic>Hadoop distributed file system</topic><topic>high-performance distributed computing</topic><topic>Machine learning</topic><topic>reliability</topic><topic>Replicability</topic><topic>Replication</topic><topic>replication policy</topic><topic>Support vector machines</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ahmed, Motaz A.</creatorcontrib><creatorcontrib>Khafagy, Mohamed H.</creatorcontrib><creatorcontrib>Shaheen, Masoud E.</creatorcontrib><creatorcontrib>Kaseb, Mostafa R.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ahmed, Motaz A.</au><au>Khafagy, Mohamed H.</au><au>Shaheen, Masoud E.</au><au>Kaseb, Mostafa R.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Dynamic Replication Policy on HDFS Based on Machine Learning Clustering</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2023</date><risdate>2023</risdate><volume>11</volume><spage>18551</spage><epage>18559</epage><pages>18551-18559</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2023.3247190</doi><tpages>9</tpages><orcidid>https://orcid.org/0000-0002-2703-5487</orcidid><orcidid>https://orcid.org/0000-0003-4853-3415</orcidid><orcidid>https://orcid.org/0000-0001-9135-3271</orcidid><orcidid>https://orcid.org/0000-0003-0479-0516</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2169-3536 |
ispartof | IEEE access, 2023, Vol.11, p.18551-18559 |
issn | 2169-3536 2169-3536 |
language | eng |
recordid | cdi_proquest_journals_2780983015 |
source | DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals; IEEE Xplore Open Access Journals |
subjects | Availability Big Data Clustering Computer networks Data science Distributed computing Distributed processing Feature extraction File systems Hadoop distributed file system high-performance distributed computing Machine learning reliability Replicability Replication replication policy Support vector machines |
title | Dynamic Replication Policy on HDFS Based on Machine Learning Clustering |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T23%3A25%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Dynamic%20Replication%20Policy%20on%20HDFS%20Based%20on%20Machine%20Learning%20Clustering&rft.jtitle=IEEE%20access&rft.au=Ahmed,%20Motaz%20A.&rft.date=2023&rft.volume=11&rft.spage=18551&rft.epage=18559&rft.pages=18551-18559&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2023.3247190&rft_dat=%3Cproquest_ieee_%3E2780983015%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2780983015&rft_id=info:pmid/&rft_ieee_id=10049393&rft_doaj_id=oai_doaj_org_article_961a1ff489be4c2784250925823a0329&rfr_iscdi=true |