The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems

Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on parallel and distributed systems 2020-09, Vol.31 (9), p.2017-2031
Hauptverfasser: Xia, Wen, Zou, Xiangyu, Jiang, Hong, Zhou, Yukun, Liu, Chuanyi, Feng, Dan, Hua, Yu, Hu, Yuchong, Zhang, Yucheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 2031
container_issue 9
container_start_page 2017
container_title IEEE transactions on parallel and distributed systems
container_volume 31
creator Xia, Wen
Zou, Xiangyu
Jiang, Hong
Zhou, Yukun
Liu, Chuanyi
Feng, Dan
Hua, Yu
Hu, Yuchong
Zhang, Yucheng
description Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.
doi_str_mv 10.1109/TPDS.2020.2984632
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2391257363</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9055082</ieee_id><sourcerecordid>2391257363</sourcerecordid><originalsourceid>FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</originalsourceid><addsrcrecordid>eNo9kE1LAzEQQIMoWKs_QLwEPG_NZ3dz1F2rQkGhFbyF7O6k3domNcke-u_dUvE0c3hvBh5Ct5RMKCXqYflRLSaMMDJhqhBTzs7QiEpZZIwW_HzYiZCZYlRdoqsYN4RQIYkYoa_lGnAFsVs57C2emZhw6V0Cl7IKbOegxeW6d9-dW2HrA65MMoPQ9vtt15jUeYefTByoRfLBrAAvDjHBLl6jC2u2EW7-5hh9zp6X5Ws2f395Kx_nWcMUTxmdEpHXtWwUmJYVhMsGiLStsHkuaS25BdpAXrTGUmIYq4HmRhkmqOCN4JaP0f3p7j74nx5i0hvfBze81IwrymTOp3yg6Ilqgo8xgNX70O1MOGhK9DGgPgbUx4D6L-Dg3J2cDgD-eUWkJAXjvzs_a8s</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2391257363</pqid></control><display><type>article</type><title>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</title><source>IEEE Electronic Library (IEL)</source><creator>Xia, Wen ; Zou, Xiangyu ; Jiang, Hong ; Zhou, Yukun ; Liu, Chuanyi ; Feng, Dan ; Hua, Yu ; Hu, Yuchong ; Zhang, Yucheng</creator><creatorcontrib>Xia, Wen ; Zou, Xiangyu ; Jiang, Hong ; Zhou, Yukun ; Liu, Chuanyi ; Feng, Dan ; Hua, Yu ; Hu, Yuchong ; Zhang, Yucheng</creatorcontrib><description>Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.</description><identifier>ISSN: 1045-9219</identifier><identifier>EISSN: 1558-2183</identifier><identifier>DOI: 10.1109/TPDS.2020.2984632</identifier><identifier>CODEN: ITDSEO</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Acceleration ; content-defined chunking ; Data deduplication ; Data transmission ; Distributed databases ; Gears ; Microsoft Windows ; Normalizing ; performance evaluation ; Power capacitors ; Redundancy ; Size distribution ; storage system ; Storage systems ; Throughput</subject><ispartof>IEEE transactions on parallel and distributed systems, 2020-09, Vol.31 (9), p.2017-2031</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</citedby><cites>FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</cites><orcidid>0000-0003-1265-7141 ; 0000-0001-5104-8301 ; 0000-0003-4093-6391 ; 0000-0001-7716-1214 ; 0000-0002-1477-9751 ; 0000-0003-0774-462X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9055082$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9055082$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Xia, Wen</creatorcontrib><creatorcontrib>Zou, Xiangyu</creatorcontrib><creatorcontrib>Jiang, Hong</creatorcontrib><creatorcontrib>Zhou, Yukun</creatorcontrib><creatorcontrib>Liu, Chuanyi</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Hua, Yu</creatorcontrib><creatorcontrib>Hu, Yuchong</creatorcontrib><creatorcontrib>Zhang, Yucheng</creatorcontrib><title>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</title><title>IEEE transactions on parallel and distributed systems</title><addtitle>TPDS</addtitle><description>Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.</description><subject>Acceleration</subject><subject>content-defined chunking</subject><subject>Data deduplication</subject><subject>Data transmission</subject><subject>Distributed databases</subject><subject>Gears</subject><subject>Microsoft Windows</subject><subject>Normalizing</subject><subject>performance evaluation</subject><subject>Power capacitors</subject><subject>Redundancy</subject><subject>Size distribution</subject><subject>storage system</subject><subject>Storage systems</subject><subject>Throughput</subject><issn>1045-9219</issn><issn>1558-2183</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1LAzEQQIMoWKs_QLwEPG_NZ3dz1F2rQkGhFbyF7O6k3domNcke-u_dUvE0c3hvBh5Ct5RMKCXqYflRLSaMMDJhqhBTzs7QiEpZZIwW_HzYiZCZYlRdoqsYN4RQIYkYoa_lGnAFsVs57C2emZhw6V0Cl7IKbOegxeW6d9-dW2HrA65MMoPQ9vtt15jUeYefTByoRfLBrAAvDjHBLl6jC2u2EW7-5hh9zp6X5Ws2f395Kx_nWcMUTxmdEpHXtWwUmJYVhMsGiLStsHkuaS25BdpAXrTGUmIYq4HmRhkmqOCN4JaP0f3p7j74nx5i0hvfBze81IwrymTOp3yg6Ilqgo8xgNX70O1MOGhK9DGgPgbUx4D6L-Dg3J2cDgD-eUWkJAXjvzs_a8s</recordid><startdate>20200901</startdate><enddate>20200901</enddate><creator>Xia, Wen</creator><creator>Zou, Xiangyu</creator><creator>Jiang, Hong</creator><creator>Zhou, Yukun</creator><creator>Liu, Chuanyi</creator><creator>Feng, Dan</creator><creator>Hua, Yu</creator><creator>Hu, Yuchong</creator><creator>Zhang, Yucheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1265-7141</orcidid><orcidid>https://orcid.org/0000-0001-5104-8301</orcidid><orcidid>https://orcid.org/0000-0003-4093-6391</orcidid><orcidid>https://orcid.org/0000-0001-7716-1214</orcidid><orcidid>https://orcid.org/0000-0002-1477-9751</orcidid><orcidid>https://orcid.org/0000-0003-0774-462X</orcidid></search><sort><creationdate>20200901</creationdate><title>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</title><author>Xia, Wen ; Zou, Xiangyu ; Jiang, Hong ; Zhou, Yukun ; Liu, Chuanyi ; Feng, Dan ; Hua, Yu ; Hu, Yuchong ; Zhang, Yucheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Acceleration</topic><topic>content-defined chunking</topic><topic>Data deduplication</topic><topic>Data transmission</topic><topic>Distributed databases</topic><topic>Gears</topic><topic>Microsoft Windows</topic><topic>Normalizing</topic><topic>performance evaluation</topic><topic>Power capacitors</topic><topic>Redundancy</topic><topic>Size distribution</topic><topic>storage system</topic><topic>Storage systems</topic><topic>Throughput</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xia, Wen</creatorcontrib><creatorcontrib>Zou, Xiangyu</creatorcontrib><creatorcontrib>Jiang, Hong</creatorcontrib><creatorcontrib>Zhou, Yukun</creatorcontrib><creatorcontrib>Liu, Chuanyi</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Hua, Yu</creatorcontrib><creatorcontrib>Hu, Yuchong</creatorcontrib><creatorcontrib>Zhang, Yucheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on parallel and distributed systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xia, Wen</au><au>Zou, Xiangyu</au><au>Jiang, Hong</au><au>Zhou, Yukun</au><au>Liu, Chuanyi</au><au>Feng, Dan</au><au>Hua, Yu</au><au>Hu, Yuchong</au><au>Zhang, Yucheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</atitle><jtitle>IEEE transactions on parallel and distributed systems</jtitle><stitle>TPDS</stitle><date>2020-09-01</date><risdate>2020</risdate><volume>31</volume><issue>9</issue><spage>2017</spage><epage>2031</epage><pages>2017-2031</pages><issn>1045-9219</issn><eissn>1558-2183</eissn><coden>ITDSEO</coden><abstract>Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TPDS.2020.2984632</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0003-1265-7141</orcidid><orcidid>https://orcid.org/0000-0001-5104-8301</orcidid><orcidid>https://orcid.org/0000-0003-4093-6391</orcidid><orcidid>https://orcid.org/0000-0001-7716-1214</orcidid><orcidid>https://orcid.org/0000-0002-1477-9751</orcidid><orcidid>https://orcid.org/0000-0003-0774-462X</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1045-9219
ispartof IEEE transactions on parallel and distributed systems, 2020-09, Vol.31 (9), p.2017-2031
issn 1045-9219
1558-2183
language eng
recordid cdi_proquest_journals_2391257363
source IEEE Electronic Library (IEL)
subjects Acceleration
content-defined chunking
Data deduplication
Data transmission
Distributed databases
Gears
Microsoft Windows
Normalizing
performance evaluation
Power capacitors
Redundancy
Size distribution
storage system
Storage systems
Throughput
title The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T00%3A38%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Design%20of%20Fast%20Content-Defined%20Chunking%20for%20Data%20Deduplication%20Based%20Storage%20Systems&rft.jtitle=IEEE%20transactions%20on%20parallel%20and%20distributed%20systems&rft.au=Xia,%20Wen&rft.date=2020-09-01&rft.volume=31&rft.issue=9&rft.spage=2017&rft.epage=2031&rft.pages=2017-2031&rft.issn=1045-9219&rft.eissn=1558-2183&rft.coden=ITDSEO&rft_id=info:doi/10.1109/TPDS.2020.2984632&rft_dat=%3Cproquest_RIE%3E2391257363%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2391257363&rft_id=info:pmid/&rft_ieee_id=9055082&rfr_iscdi=true