Scalable high performance de-duplication backup via hash join

Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output （I/O） overhead as a result of consta...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Frontiers of information technology & electronic engineering 2010-05, Vol.11 (5), p.315-327
Hauptverfasser:	Yang, Tian-ming, Feng, Dan, Niu, Zhong-ying, Wan, Ya-ping
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Back up systems Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Electrical Engineering Electronics and Microelectronics Fingerprints Instrumentation Networks Storage systems Workload Workloads
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	327
container_issue	5
container_start_page	315
container_title	Frontiers of information technology & electronic engineering
container_volume	11
creator	Yang, Tian-ming Feng, Dan Niu, Zhong-ying Wan, Ya-ping
description	Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output （I/O） overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.
doi_str_mv	10.1631/jzus.C0910445
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2918723220</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><cqvip_id>33797338</cqvip_id><sourcerecordid>2918723220</sourcerecordid><originalsourceid>FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</originalsourceid><addsrcrecordid>eNp1kL1PwzAQxS0EElXpyG7BnOLLJXY8MKCKL6kSAyCxWY7tNAlpktoNEvz1pGqBibvhbvi996RHyDmwOXCEq_prCPMFk8CSJD0iE8i4jEDyt-PfP4VTMguhZuNgmkqOE3L9bHSj88bRslqVtHe-6Pxat8ZR6yI79E1l9LbqWppr8z709KPStNShpHVXtWfkpNBNcLPDnZLXu9uXxUO0fLp_XNwsI4MI2yh3BQfLkgwZWgapsSaDXBibWKNdwUzuNMecyWxcbR2DWHCbgtAcRMwETsnl3rf33WZwYavqbvDtGKliCZmIMY7ZSEV7yvguBO8K1ftqrf2nAqZ2JaldSeqnpJGf7_kwcu3K-T_X_wQXh4Cya1ebUaN2rRRV4xSikAIxw296b3TQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2918723220</pqid></control><display><type>article</type><title>Scalable high performance de-duplication backup via hash join</title><source>SpringerLink Journals</source><source>Alma/SFX Local Collection</source><source>ProQuest Central</source><creator>Yang, Tian-ming ; Feng, Dan ; Niu, Zhong-ying ; Wan, Ya-ping</creator><creatorcontrib>Yang, Tian-ming ; Feng, Dan ; Niu, Zhong-ying ; Wan, Ya-ping</creatorcontrib><description>Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output （I/O） overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.</description><identifier>ISSN: 1869-1951</identifier><identifier>ISSN: 2095-9184</identifier><identifier>EISSN: 1869-196X</identifier><identifier>EISSN: 2095-9230</identifier><identifier>DOI: 10.1631/jzus.C0910445</identifier><language>eng</language><publisher>Heidelberg: SP Zhejiang University Press</publisher><subject>Algorithms ; Back up systems ; Communications Engineering ; Computer Hardware ; Computer Science ; Computer Systems Organization and Communication Networks ; Electrical Engineering ; Electronics and Microelectronics ; Fingerprints ; Instrumentation ; Networks ; Storage systems ; Workload ; Workloads</subject><ispartof>Frontiers of information technology & electronic engineering, 2010-05, Vol.11 (5), p.315-327</ispartof><rights>“Journal of Zhejiang University Science” Editorial Office and Springer-Verlag Berlin Heidelberg 2010</rights><rights>“Journal of Zhejiang University Science” Editorial Office and Springer-Verlag Berlin Heidelberg 2010.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</citedby><cites>FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttp://image.cqvip.com/vip1000/qk/89589X/89589X.jpg</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1631/jzus.C0910445$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2918723220?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,776,780,21367,27901,27902,33721,41464,42533,43781,51294</link.rule.ids></links><search><creatorcontrib>Yang, Tian-ming</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Niu, Zhong-ying</creatorcontrib><creatorcontrib>Wan, Ya-ping</creatorcontrib><title>Scalable high performance de-duplication backup via hash join</title><title>Frontiers of information technology & electronic engineering</title><addtitle>J. Zhejiang Univ. - Sci. C</addtitle><addtitle>Journal of zhejiang university science</addtitle><description>Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output （I/O） overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.</description><subject>Algorithms</subject><subject>Back up systems</subject><subject>Communications Engineering</subject><subject>Computer Hardware</subject><subject>Computer Science</subject><subject>Computer Systems Organization and Communication Networks</subject><subject>Electrical Engineering</subject><subject>Electronics and Microelectronics</subject><subject>Fingerprints</subject><subject>Instrumentation</subject><subject>Networks</subject><subject>Storage systems</subject><subject>Workload</subject><subject>Workloads</subject><issn>1869-1951</issn><issn>2095-9184</issn><issn>1869-196X</issn><issn>2095-9230</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNp1kL1PwzAQxS0EElXpyG7BnOLLJXY8MKCKL6kSAyCxWY7tNAlpktoNEvz1pGqBibvhbvi996RHyDmwOXCEq_prCPMFk8CSJD0iE8i4jEDyt-PfP4VTMguhZuNgmkqOE3L9bHSj88bRslqVtHe-6Pxat8ZR6yI79E1l9LbqWppr8z709KPStNShpHVXtWfkpNBNcLPDnZLXu9uXxUO0fLp_XNwsI4MI2yh3BQfLkgwZWgapsSaDXBibWKNdwUzuNMecyWxcbR2DWHCbgtAcRMwETsnl3rf33WZwYavqbvDtGKliCZmIMY7ZSEV7yvguBO8K1ftqrf2nAqZ2JaldSeqnpJGf7_kwcu3K-T_X_wQXh4Cya1ebUaN2rRRV4xSikAIxw296b3TQ</recordid><startdate>20100501</startdate><enddate>20100501</enddate><creator>Yang, Tian-ming</creator><creator>Feng, Dan</creator><creator>Niu, Zhong-ying</creator><creator>Wan, Ya-ping</creator><general>SP Zhejiang University Press</general><general>Springer Nature B.V</general><scope>2RA</scope><scope>92L</scope><scope>CQIGP</scope><scope>W92</scope><scope>~WA</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope></search><sort><creationdate>20100501</creationdate><title>Scalable high performance de-duplication backup via hash join</title><author>Yang, Tian-ming ; Feng, Dan ; Niu, Zhong-ying ; Wan, Ya-ping</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Algorithms</topic><topic>Back up systems</topic><topic>Communications Engineering</topic><topic>Computer Hardware</topic><topic>Computer Science</topic><topic>Computer Systems Organization and Communication Networks</topic><topic>Electrical Engineering</topic><topic>Electronics and Microelectronics</topic><topic>Fingerprints</topic><topic>Instrumentation</topic><topic>Networks</topic><topic>Storage systems</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Tian-ming</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Niu, Zhong-ying</creatorcontrib><creatorcontrib>Wan, Ya-ping</creatorcontrib><collection>中文科技期刊数据库</collection><collection>中文科技期刊数据库-CALIS站点</collection><collection>中文科技期刊数据库-7.0平台</collection><collection>中文科技期刊数据库-工程技术</collection><collection>中文科技期刊数据库- 镜像站点</collection><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied & Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection><jtitle>Frontiers of information technology & electronic engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Tian-ming</au><au>Feng, Dan</au><au>Niu, Zhong-ying</au><au>Wan, Ya-ping</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scalable high performance de-duplication backup via hash join</atitle><jtitle>Frontiers of information technology & electronic engineering</jtitle><stitle>J. Zhejiang Univ. - Sci. C</stitle><addtitle>Journal of zhejiang university science</addtitle><date>2010-05-01</date><risdate>2010</risdate><volume>11</volume><issue>5</issue><spage>315</spage><epage>327</epage><pages>315-327</pages><issn>1869-1951</issn><issn>2095-9184</issn><eissn>1869-196X</eissn><eissn>2095-9230</eissn><abstract>Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output （I/O） overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.</abstract><cop>Heidelberg</cop><pub>SP Zhejiang University Press</pub><doi>10.1631/jzus.C0910445</doi><tpages>13</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1869-1951
ispartof	Frontiers of information technology & electronic engineering, 2010-05, Vol.11 (5), p.315-327
issn	1869-1951 2095-9184 1869-196X 2095-9230
language	eng
recordid	cdi_proquest_journals_2918723220
source	SpringerLink Journals; Alma/SFX Local Collection; ProQuest Central
subjects	Algorithms Back up systems Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Electrical Engineering Electronics and Microelectronics Fingerprints Instrumentation Networks Storage systems Workload Workloads
title	Scalable high performance de-duplication backup via hash join
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T09%3A12%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scalable%20high%20performance%20de-duplication%20backup%20via%20hash%20join&rft.jtitle=Frontiers%20of%20information%20technology%20&%20electronic%20engineering&rft.au=Yang,%20Tian-ming&rft.date=2010-05-01&rft.volume=11&rft.issue=5&rft.spage=315&rft.epage=327&rft.pages=315-327&rft.issn=1869-1951&rft.eissn=1869-196X&rft_id=info:doi/10.1631/jzus.C0910445&rft_dat=%3Cproquest_cross%3E2918723220%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2918723220&rft_id=info:pmid/&rft_cqvip_id=33797338&rfr_iscdi=true