The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems
Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on parallel and distributed systems 2020-09, Vol.31 (9), p.2017-2031 |
---|---|
Hauptverfasser: | , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 2031 |
---|---|
container_issue | 9 |
container_start_page | 2017 |
container_title | IEEE transactions on parallel and distributed systems |
container_volume | 31 |
creator | Xia, Wen Zou, Xiangyu Jiang, Hong Zhou, Yukun Liu, Chuanyi Feng, Dan Hua, Yu Hu, Yuchong Zhang, Yucheng |
description | Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers. |
doi_str_mv | 10.1109/TPDS.2020.2984632 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2391257363</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9055082</ieee_id><sourcerecordid>2391257363</sourcerecordid><originalsourceid>FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</originalsourceid><addsrcrecordid>eNo9kE1LAzEQQIMoWKs_QLwEPG_NZ3dz1F2rQkGhFbyF7O6k3domNcke-u_dUvE0c3hvBh5Ct5RMKCXqYflRLSaMMDJhqhBTzs7QiEpZZIwW_HzYiZCZYlRdoqsYN4RQIYkYoa_lGnAFsVs57C2emZhw6V0Cl7IKbOegxeW6d9-dW2HrA65MMoPQ9vtt15jUeYefTByoRfLBrAAvDjHBLl6jC2u2EW7-5hh9zp6X5Ws2f395Kx_nWcMUTxmdEpHXtWwUmJYVhMsGiLStsHkuaS25BdpAXrTGUmIYq4HmRhkmqOCN4JaP0f3p7j74nx5i0hvfBze81IwrymTOp3yg6Ilqgo8xgNX70O1MOGhK9DGgPgbUx4D6L-Dg3J2cDgD-eUWkJAXjvzs_a8s</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2391257363</pqid></control><display><type>article</type><title>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</title><source>IEEE Electronic Library (IEL)</source><creator>Xia, Wen ; Zou, Xiangyu ; Jiang, Hong ; Zhou, Yukun ; Liu, Chuanyi ; Feng, Dan ; Hua, Yu ; Hu, Yuchong ; Zhang, Yucheng</creator><creatorcontrib>Xia, Wen ; Zou, Xiangyu ; Jiang, Hong ; Zhou, Yukun ; Liu, Chuanyi ; Feng, Dan ; Hua, Yu ; Hu, Yuchong ; Zhang, Yucheng</creatorcontrib><description>Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.</description><identifier>ISSN: 1045-9219</identifier><identifier>EISSN: 1558-2183</identifier><identifier>DOI: 10.1109/TPDS.2020.2984632</identifier><identifier>CODEN: ITDSEO</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Acceleration ; content-defined chunking ; Data deduplication ; Data transmission ; Distributed databases ; Gears ; Microsoft Windows ; Normalizing ; performance evaluation ; Power capacitors ; Redundancy ; Size distribution ; storage system ; Storage systems ; Throughput</subject><ispartof>IEEE transactions on parallel and distributed systems, 2020-09, Vol.31 (9), p.2017-2031</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</citedby><cites>FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</cites><orcidid>0000-0003-1265-7141 ; 0000-0001-5104-8301 ; 0000-0003-4093-6391 ; 0000-0001-7716-1214 ; 0000-0002-1477-9751 ; 0000-0003-0774-462X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9055082$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9055082$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Xia, Wen</creatorcontrib><creatorcontrib>Zou, Xiangyu</creatorcontrib><creatorcontrib>Jiang, Hong</creatorcontrib><creatorcontrib>Zhou, Yukun</creatorcontrib><creatorcontrib>Liu, Chuanyi</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Hua, Yu</creatorcontrib><creatorcontrib>Hu, Yuchong</creatorcontrib><creatorcontrib>Zhang, Yucheng</creatorcontrib><title>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</title><title>IEEE transactions on parallel and distributed systems</title><addtitle>TPDS</addtitle><description>Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.</description><subject>Acceleration</subject><subject>content-defined chunking</subject><subject>Data deduplication</subject><subject>Data transmission</subject><subject>Distributed databases</subject><subject>Gears</subject><subject>Microsoft Windows</subject><subject>Normalizing</subject><subject>performance evaluation</subject><subject>Power capacitors</subject><subject>Redundancy</subject><subject>Size distribution</subject><subject>storage system</subject><subject>Storage systems</subject><subject>Throughput</subject><issn>1045-9219</issn><issn>1558-2183</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1LAzEQQIMoWKs_QLwEPG_NZ3dz1F2rQkGhFbyF7O6k3domNcke-u_dUvE0c3hvBh5Ct5RMKCXqYflRLSaMMDJhqhBTzs7QiEpZZIwW_HzYiZCZYlRdoqsYN4RQIYkYoa_lGnAFsVs57C2emZhw6V0Cl7IKbOegxeW6d9-dW2HrA65MMoPQ9vtt15jUeYefTByoRfLBrAAvDjHBLl6jC2u2EW7-5hh9zp6X5Ws2f395Kx_nWcMUTxmdEpHXtWwUmJYVhMsGiLStsHkuaS25BdpAXrTGUmIYq4HmRhkmqOCN4JaP0f3p7j74nx5i0hvfBze81IwrymTOp3yg6Ilqgo8xgNX70O1MOGhK9DGgPgbUx4D6L-Dg3J2cDgD-eUWkJAXjvzs_a8s</recordid><startdate>20200901</startdate><enddate>20200901</enddate><creator>Xia, Wen</creator><creator>Zou, Xiangyu</creator><creator>Jiang, Hong</creator><creator>Zhou, Yukun</creator><creator>Liu, Chuanyi</creator><creator>Feng, Dan</creator><creator>Hua, Yu</creator><creator>Hu, Yuchong</creator><creator>Zhang, Yucheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1265-7141</orcidid><orcidid>https://orcid.org/0000-0001-5104-8301</orcidid><orcidid>https://orcid.org/0000-0003-4093-6391</orcidid><orcidid>https://orcid.org/0000-0001-7716-1214</orcidid><orcidid>https://orcid.org/0000-0002-1477-9751</orcidid><orcidid>https://orcid.org/0000-0003-0774-462X</orcidid></search><sort><creationdate>20200901</creationdate><title>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</title><author>Xia, Wen ; Zou, Xiangyu ; Jiang, Hong ; Zhou, Yukun ; Liu, Chuanyi ; Feng, Dan ; Hua, Yu ; Hu, Yuchong ; Zhang, Yucheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Acceleration</topic><topic>content-defined chunking</topic><topic>Data deduplication</topic><topic>Data transmission</topic><topic>Distributed databases</topic><topic>Gears</topic><topic>Microsoft Windows</topic><topic>Normalizing</topic><topic>performance evaluation</topic><topic>Power capacitors</topic><topic>Redundancy</topic><topic>Size distribution</topic><topic>storage system</topic><topic>Storage systems</topic><topic>Throughput</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xia, Wen</creatorcontrib><creatorcontrib>Zou, Xiangyu</creatorcontrib><creatorcontrib>Jiang, Hong</creatorcontrib><creatorcontrib>Zhou, Yukun</creatorcontrib><creatorcontrib>Liu, Chuanyi</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Hua, Yu</creatorcontrib><creatorcontrib>Hu, Yuchong</creatorcontrib><creatorcontrib>Zhang, Yucheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on parallel and distributed systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xia, Wen</au><au>Zou, Xiangyu</au><au>Jiang, Hong</au><au>Zhou, Yukun</au><au>Liu, Chuanyi</au><au>Feng, Dan</au><au>Hua, Yu</au><au>Hu, Yuchong</au><au>Zhang, Yucheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</atitle><jtitle>IEEE transactions on parallel and distributed systems</jtitle><stitle>TPDS</stitle><date>2020-09-01</date><risdate>2020</risdate><volume>31</volume><issue>9</issue><spage>2017</spage><epage>2031</epage><pages>2017-2031</pages><issn>1045-9219</issn><eissn>1558-2183</eissn><coden>ITDSEO</coden><abstract>Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TPDS.2020.2984632</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0003-1265-7141</orcidid><orcidid>https://orcid.org/0000-0001-5104-8301</orcidid><orcidid>https://orcid.org/0000-0003-4093-6391</orcidid><orcidid>https://orcid.org/0000-0001-7716-1214</orcidid><orcidid>https://orcid.org/0000-0002-1477-9751</orcidid><orcidid>https://orcid.org/0000-0003-0774-462X</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1045-9219 |
ispartof | IEEE transactions on parallel and distributed systems, 2020-09, Vol.31 (9), p.2017-2031 |
issn | 1045-9219 1558-2183 |
language | eng |
recordid | cdi_proquest_journals_2391257363 |
source | IEEE Electronic Library (IEL) |
subjects | Acceleration content-defined chunking Data deduplication Data transmission Distributed databases Gears Microsoft Windows Normalizing performance evaluation Power capacitors Redundancy Size distribution storage system Storage systems Throughput |
title | The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T00%3A38%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Design%20of%20Fast%20Content-Defined%20Chunking%20for%20Data%20Deduplication%20Based%20Storage%20Systems&rft.jtitle=IEEE%20transactions%20on%20parallel%20and%20distributed%20systems&rft.au=Xia,%20Wen&rft.date=2020-09-01&rft.volume=31&rft.issue=9&rft.spage=2017&rft.epage=2031&rft.pages=2017-2031&rft.issn=1045-9219&rft.eissn=1558-2183&rft.coden=ITDSEO&rft_id=info:doi/10.1109/TPDS.2020.2984632&rft_dat=%3Cproquest_RIE%3E2391257363%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2391257363&rft_id=info:pmid/&rft_ieee_id=9055082&rfr_iscdi=true |