Scalable high performance de-duplication backup via hash join

Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of consta...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Frontiers of information technology & electronic engineering 2010-05, Vol.11 (5), p.315-327
Hauptverfasser: Yang, Tian-ming, Feng, Dan, Niu, Zhong-ying, Wan, Ya-ping
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 327
container_issue 5
container_start_page 315
container_title Frontiers of information technology & electronic engineering
container_volume 11
creator Yang, Tian-ming
Feng, Dan
Niu, Zhong-ying
Wan, Ya-ping
description Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.
doi_str_mv 10.1631/jzus.C0910445
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2918723220</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><cqvip_id>33797338</cqvip_id><sourcerecordid>2918723220</sourcerecordid><originalsourceid>FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</originalsourceid><addsrcrecordid>eNp1kL1PwzAQxS0EElXpyG7BnOLLJXY8MKCKL6kSAyCxWY7tNAlpktoNEvz1pGqBibvhbvi996RHyDmwOXCEq_prCPMFk8CSJD0iE8i4jEDyt-PfP4VTMguhZuNgmkqOE3L9bHSj88bRslqVtHe-6Pxat8ZR6yI79E1l9LbqWppr8z709KPStNShpHVXtWfkpNBNcLPDnZLXu9uXxUO0fLp_XNwsI4MI2yh3BQfLkgwZWgapsSaDXBibWKNdwUzuNMecyWxcbR2DWHCbgtAcRMwETsnl3rf33WZwYavqbvDtGKliCZmIMY7ZSEV7yvguBO8K1ftqrf2nAqZ2JaldSeqnpJGf7_kwcu3K-T_X_wQXh4Cya1ebUaN2rRRV4xSikAIxw296b3TQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2918723220</pqid></control><display><type>article</type><title>Scalable high performance de-duplication backup via hash join</title><source>SpringerLink Journals</source><source>Alma/SFX Local Collection</source><source>ProQuest Central</source><creator>Yang, Tian-ming ; Feng, Dan ; Niu, Zhong-ying ; Wan, Ya-ping</creator><creatorcontrib>Yang, Tian-ming ; Feng, Dan ; Niu, Zhong-ying ; Wan, Ya-ping</creatorcontrib><description>Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.</description><identifier>ISSN: 1869-1951</identifier><identifier>ISSN: 2095-9184</identifier><identifier>EISSN: 1869-196X</identifier><identifier>EISSN: 2095-9230</identifier><identifier>DOI: 10.1631/jzus.C0910445</identifier><language>eng</language><publisher>Heidelberg: SP Zhejiang University Press</publisher><subject>Algorithms ; Back up systems ; Communications Engineering ; Computer Hardware ; Computer Science ; Computer Systems Organization and Communication Networks ; Electrical Engineering ; Electronics and Microelectronics ; Fingerprints ; Instrumentation ; Networks ; Storage systems ; Workload ; Workloads</subject><ispartof>Frontiers of information technology &amp; electronic engineering, 2010-05, Vol.11 (5), p.315-327</ispartof><rights>“Journal of Zhejiang University Science” Editorial Office and Springer-Verlag Berlin Heidelberg 2010</rights><rights>“Journal of Zhejiang University Science” Editorial Office and Springer-Verlag Berlin Heidelberg 2010.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</citedby><cites>FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttp://image.cqvip.com/vip1000/qk/89589X/89589X.jpg</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1631/jzus.C0910445$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2918723220?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,776,780,21367,27901,27902,33721,41464,42533,43781,51294</link.rule.ids></links><search><creatorcontrib>Yang, Tian-ming</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Niu, Zhong-ying</creatorcontrib><creatorcontrib>Wan, Ya-ping</creatorcontrib><title>Scalable high performance de-duplication backup via hash join</title><title>Frontiers of information technology &amp; electronic engineering</title><addtitle>J. Zhejiang Univ. - Sci. C</addtitle><addtitle>Journal of zhejiang university science</addtitle><description>Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.</description><subject>Algorithms</subject><subject>Back up systems</subject><subject>Communications Engineering</subject><subject>Computer Hardware</subject><subject>Computer Science</subject><subject>Computer Systems Organization and Communication Networks</subject><subject>Electrical Engineering</subject><subject>Electronics and Microelectronics</subject><subject>Fingerprints</subject><subject>Instrumentation</subject><subject>Networks</subject><subject>Storage systems</subject><subject>Workload</subject><subject>Workloads</subject><issn>1869-1951</issn><issn>2095-9184</issn><issn>1869-196X</issn><issn>2095-9230</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNp1kL1PwzAQxS0EElXpyG7BnOLLJXY8MKCKL6kSAyCxWY7tNAlpktoNEvz1pGqBibvhbvi996RHyDmwOXCEq_prCPMFk8CSJD0iE8i4jEDyt-PfP4VTMguhZuNgmkqOE3L9bHSj88bRslqVtHe-6Pxat8ZR6yI79E1l9LbqWppr8z709KPStNShpHVXtWfkpNBNcLPDnZLXu9uXxUO0fLp_XNwsI4MI2yh3BQfLkgwZWgapsSaDXBibWKNdwUzuNMecyWxcbR2DWHCbgtAcRMwETsnl3rf33WZwYavqbvDtGKliCZmIMY7ZSEV7yvguBO8K1ftqrf2nAqZ2JaldSeqnpJGf7_kwcu3K-T_X_wQXh4Cya1ebUaN2rRRV4xSikAIxw296b3TQ</recordid><startdate>20100501</startdate><enddate>20100501</enddate><creator>Yang, Tian-ming</creator><creator>Feng, Dan</creator><creator>Niu, Zhong-ying</creator><creator>Wan, Ya-ping</creator><general>SP Zhejiang University Press</general><general>Springer Nature B.V</general><scope>2RA</scope><scope>92L</scope><scope>CQIGP</scope><scope>W92</scope><scope>~WA</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope></search><sort><creationdate>20100501</creationdate><title>Scalable high performance de-duplication backup via hash join</title><author>Yang, Tian-ming ; Feng, Dan ; Niu, Zhong-ying ; Wan, Ya-ping</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Algorithms</topic><topic>Back up systems</topic><topic>Communications Engineering</topic><topic>Computer Hardware</topic><topic>Computer Science</topic><topic>Computer Systems Organization and Communication Networks</topic><topic>Electrical Engineering</topic><topic>Electronics and Microelectronics</topic><topic>Fingerprints</topic><topic>Instrumentation</topic><topic>Networks</topic><topic>Storage systems</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Tian-ming</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Niu, Zhong-ying</creatorcontrib><creatorcontrib>Wan, Ya-ping</creatorcontrib><collection>中文科技期刊数据库</collection><collection>中文科技期刊数据库-CALIS站点</collection><collection>中文科技期刊数据库-7.0平台</collection><collection>中文科技期刊数据库-工程技术</collection><collection>中文科技期刊数据库- 镜像站点</collection><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied &amp; Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection><jtitle>Frontiers of information technology &amp; electronic engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Tian-ming</au><au>Feng, Dan</au><au>Niu, Zhong-ying</au><au>Wan, Ya-ping</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scalable high performance de-duplication backup via hash join</atitle><jtitle>Frontiers of information technology &amp; electronic engineering</jtitle><stitle>J. Zhejiang Univ. - Sci. C</stitle><addtitle>Journal of zhejiang university science</addtitle><date>2010-05-01</date><risdate>2010</risdate><volume>11</volume><issue>5</issue><spage>315</spage><epage>327</epage><pages>315-327</pages><issn>1869-1951</issn><issn>2095-9184</issn><eissn>1869-196X</eissn><eissn>2095-9230</eissn><abstract>Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.</abstract><cop>Heidelberg</cop><pub>SP Zhejiang University Press</pub><doi>10.1631/jzus.C0910445</doi><tpages>13</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1869-1951
ispartof Frontiers of information technology & electronic engineering, 2010-05, Vol.11 (5), p.315-327
issn 1869-1951
2095-9184
1869-196X
2095-9230
language eng
recordid cdi_proquest_journals_2918723220
source SpringerLink Journals; Alma/SFX Local Collection; ProQuest Central
subjects Algorithms
Back up systems
Communications Engineering
Computer Hardware
Computer Science
Computer Systems Organization and Communication Networks
Electrical Engineering
Electronics and Microelectronics
Fingerprints
Instrumentation
Networks
Storage systems
Workload
Workloads
title Scalable high performance de-duplication backup via hash join
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T09%3A12%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scalable%20high%20performance%20de-duplication%20backup%20via%20hash%20join&rft.jtitle=Frontiers%20of%20information%20technology%20&%20electronic%20engineering&rft.au=Yang,%20Tian-ming&rft.date=2010-05-01&rft.volume=11&rft.issue=5&rft.spage=315&rft.epage=327&rft.pages=315-327&rft.issn=1869-1951&rft.eissn=1869-196X&rft_id=info:doi/10.1631/jzus.C0910445&rft_dat=%3Cproquest_cross%3E2918723220%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2918723220&rft_id=info:pmid/&rft_cqvip_id=33797338&rfr_iscdi=true