Scalable high performance de-duplication backup via hash join
Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of consta...
Gespeichert in:
Veröffentlicht in: | Frontiers of information technology & electronic engineering 2010-05, Vol.11 (5), p.315-327 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 327 |
---|---|
container_issue | 5 |
container_start_page | 315 |
container_title | Frontiers of information technology & electronic engineering |
container_volume | 11 |
creator | Yang, Tian-ming Feng, Dan Niu, Zhong-ying Wan, Ya-ping |
description | Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems. |
doi_str_mv | 10.1631/jzus.C0910445 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2918723220</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><cqvip_id>33797338</cqvip_id><sourcerecordid>2918723220</sourcerecordid><originalsourceid>FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</originalsourceid><addsrcrecordid>eNp1kL1PwzAQxS0EElXpyG7BnOLLJXY8MKCKL6kSAyCxWY7tNAlpktoNEvz1pGqBibvhbvi996RHyDmwOXCEq_prCPMFk8CSJD0iE8i4jEDyt-PfP4VTMguhZuNgmkqOE3L9bHSj88bRslqVtHe-6Pxat8ZR6yI79E1l9LbqWppr8z709KPStNShpHVXtWfkpNBNcLPDnZLXu9uXxUO0fLp_XNwsI4MI2yh3BQfLkgwZWgapsSaDXBibWKNdwUzuNMecyWxcbR2DWHCbgtAcRMwETsnl3rf33WZwYavqbvDtGKliCZmIMY7ZSEV7yvguBO8K1ftqrf2nAqZ2JaldSeqnpJGf7_kwcu3K-T_X_wQXh4Cya1ebUaN2rRRV4xSikAIxw296b3TQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2918723220</pqid></control><display><type>article</type><title>Scalable high performance de-duplication backup via hash join</title><source>SpringerLink Journals</source><source>Alma/SFX Local Collection</source><source>ProQuest Central</source><creator>Yang, Tian-ming ; Feng, Dan ; Niu, Zhong-ying ; Wan, Ya-ping</creator><creatorcontrib>Yang, Tian-ming ; Feng, Dan ; Niu, Zhong-ying ; Wan, Ya-ping</creatorcontrib><description>Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.</description><identifier>ISSN: 1869-1951</identifier><identifier>ISSN: 2095-9184</identifier><identifier>EISSN: 1869-196X</identifier><identifier>EISSN: 2095-9230</identifier><identifier>DOI: 10.1631/jzus.C0910445</identifier><language>eng</language><publisher>Heidelberg: SP Zhejiang University Press</publisher><subject>Algorithms ; Back up systems ; Communications Engineering ; Computer Hardware ; Computer Science ; Computer Systems Organization and Communication Networks ; Electrical Engineering ; Electronics and Microelectronics ; Fingerprints ; Instrumentation ; Networks ; Storage systems ; Workload ; Workloads</subject><ispartof>Frontiers of information technology & electronic engineering, 2010-05, Vol.11 (5), p.315-327</ispartof><rights>“Journal of Zhejiang University Science” Editorial Office and Springer-Verlag Berlin Heidelberg 2010</rights><rights>“Journal of Zhejiang University Science” Editorial Office and Springer-Verlag Berlin Heidelberg 2010.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</citedby><cites>FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttp://image.cqvip.com/vip1000/qk/89589X/89589X.jpg</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1631/jzus.C0910445$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2918723220?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,776,780,21367,27901,27902,33721,41464,42533,43781,51294</link.rule.ids></links><search><creatorcontrib>Yang, Tian-ming</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Niu, Zhong-ying</creatorcontrib><creatorcontrib>Wan, Ya-ping</creatorcontrib><title>Scalable high performance de-duplication backup via hash join</title><title>Frontiers of information technology & electronic engineering</title><addtitle>J. Zhejiang Univ. - Sci. C</addtitle><addtitle>Journal of zhejiang university science</addtitle><description>Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.</description><subject>Algorithms</subject><subject>Back up systems</subject><subject>Communications Engineering</subject><subject>Computer Hardware</subject><subject>Computer Science</subject><subject>Computer Systems Organization and Communication Networks</subject><subject>Electrical Engineering</subject><subject>Electronics and Microelectronics</subject><subject>Fingerprints</subject><subject>Instrumentation</subject><subject>Networks</subject><subject>Storage systems</subject><subject>Workload</subject><subject>Workloads</subject><issn>1869-1951</issn><issn>2095-9184</issn><issn>1869-196X</issn><issn>2095-9230</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNp1kL1PwzAQxS0EElXpyG7BnOLLJXY8MKCKL6kSAyCxWY7tNAlpktoNEvz1pGqBibvhbvi996RHyDmwOXCEq_prCPMFk8CSJD0iE8i4jEDyt-PfP4VTMguhZuNgmkqOE3L9bHSj88bRslqVtHe-6Pxat8ZR6yI79E1l9LbqWppr8z709KPStNShpHVXtWfkpNBNcLPDnZLXu9uXxUO0fLp_XNwsI4MI2yh3BQfLkgwZWgapsSaDXBibWKNdwUzuNMecyWxcbR2DWHCbgtAcRMwETsnl3rf33WZwYavqbvDtGKliCZmIMY7ZSEV7yvguBO8K1ftqrf2nAqZ2JaldSeqnpJGf7_kwcu3K-T_X_wQXh4Cya1ebUaN2rRRV4xSikAIxw296b3TQ</recordid><startdate>20100501</startdate><enddate>20100501</enddate><creator>Yang, Tian-ming</creator><creator>Feng, Dan</creator><creator>Niu, Zhong-ying</creator><creator>Wan, Ya-ping</creator><general>SP Zhejiang University Press</general><general>Springer Nature B.V</general><scope>2RA</scope><scope>92L</scope><scope>CQIGP</scope><scope>W92</scope><scope>~WA</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope></search><sort><creationdate>20100501</creationdate><title>Scalable high performance de-duplication backup via hash join</title><author>Yang, Tian-ming ; Feng, Dan ; Niu, Zhong-ying ; Wan, Ya-ping</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c331t-bef61d048303d015cdc81b7cd4dcaef0cbea63b098989ade01276d517a6172073</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Algorithms</topic><topic>Back up systems</topic><topic>Communications Engineering</topic><topic>Computer Hardware</topic><topic>Computer Science</topic><topic>Computer Systems Organization and Communication Networks</topic><topic>Electrical Engineering</topic><topic>Electronics and Microelectronics</topic><topic>Fingerprints</topic><topic>Instrumentation</topic><topic>Networks</topic><topic>Storage systems</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Tian-ming</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Niu, Zhong-ying</creatorcontrib><creatorcontrib>Wan, Ya-ping</creatorcontrib><collection>中文科技期刊数据库</collection><collection>中文科技期刊数据库-CALIS站点</collection><collection>中文科技期刊数据库-7.0平台</collection><collection>中文科技期刊数据库-工程技术</collection><collection>中文科技期刊数据库- 镜像站点</collection><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied & Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection><jtitle>Frontiers of information technology & electronic engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Tian-ming</au><au>Feng, Dan</au><au>Niu, Zhong-ying</au><au>Wan, Ya-ping</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scalable high performance de-duplication backup via hash join</atitle><jtitle>Frontiers of information technology & electronic engineering</jtitle><stitle>J. Zhejiang Univ. - Sci. C</stitle><addtitle>Journal of zhejiang university science</addtitle><date>2010-05-01</date><risdate>2010</risdate><volume>11</volume><issue>5</issue><spage>315</spage><epage>327</epage><pages>315-327</pages><issn>1869-1951</issn><issn>2095-9184</issn><eissn>1869-196X</eissn><eissn>2095-9230</eissn><abstract>Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.</abstract><cop>Heidelberg</cop><pub>SP Zhejiang University Press</pub><doi>10.1631/jzus.C0910445</doi><tpages>13</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1869-1951 |
ispartof | Frontiers of information technology & electronic engineering, 2010-05, Vol.11 (5), p.315-327 |
issn | 1869-1951 2095-9184 1869-196X 2095-9230 |
language | eng |
recordid | cdi_proquest_journals_2918723220 |
source | SpringerLink Journals; Alma/SFX Local Collection; ProQuest Central |
subjects | Algorithms Back up systems Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Electrical Engineering Electronics and Microelectronics Fingerprints Instrumentation Networks Storage systems Workload Workloads |
title | Scalable high performance de-duplication backup via hash join |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T09%3A12%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scalable%20high%20performance%20de-duplication%20backup%20via%20hash%20join&rft.jtitle=Frontiers%20of%20information%20technology%20&%20electronic%20engineering&rft.au=Yang,%20Tian-ming&rft.date=2010-05-01&rft.volume=11&rft.issue=5&rft.spage=315&rft.epage=327&rft.pages=315-327&rft.issn=1869-1951&rft.eissn=1869-196X&rft_id=info:doi/10.1631/jzus.C0910445&rft_dat=%3Cproquest_cross%3E2918723220%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2918723220&rft_id=info:pmid/&rft_cqvip_id=33797338&rfr_iscdi=true |