The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems

Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2020-09, Vol.31 (9), p.2017-2031
Hauptverfasser:	Xia, Wen, Zou, Xiangyu, Jiang, Hong, Zhou, Yukun, Liu, Chuanyi, Feng, Dan, Hua, Yu, Hu, Yuchong, Zhang, Yucheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Acceleration content-defined chunking Data deduplication Data transmission Distributed databases Gears Microsoft Windows Normalizing performance evaluation Power capacitors Redundancy Size distribution storage system Storage systems Throughput
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	2031
container_issue	9
container_start_page	2017
container_title	IEEE transactions on parallel and distributed systems
container_volume	31
creator	Xia, Wen Zou, Xiangyu Jiang, Hong Zhou, Yukun Liu, Chuanyi Feng, Dan Hua, Yu Hu, Yuchong Zhang, Yucheng
description	Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.
doi_str_mv	10.1109/TPDS.2020.2984632
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2391257363</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9055082</ieee_id><sourcerecordid>2391257363</sourcerecordid><originalsourceid>FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</originalsourceid><addsrcrecordid>eNo9kE1LAzEQQIMoWKs_QLwEPG_NZ3dz1F2rQkGhFbyF7O6k3domNcke-u_dUvE0c3hvBh5Ct5RMKCXqYflRLSaMMDJhqhBTzs7QiEpZZIwW_HzYiZCZYlRdoqsYN4RQIYkYoa_lGnAFsVs57C2emZhw6V0Cl7IKbOegxeW6d9-dW2HrA65MMoPQ9vtt15jUeYefTByoRfLBrAAvDjHBLl6jC2u2EW7-5hh9zp6X5Ws2f395Kx_nWcMUTxmdEpHXtWwUmJYVhMsGiLStsHkuaS25BdpAXrTGUmIYq4HmRhkmqOCN4JaP0f3p7j74nx5i0hvfBze81IwrymTOp3yg6Ilqgo8xgNX70O1MOGhK9DGgPgbUx4D6L-Dg3J2cDgD-eUWkJAXjvzs_a8s</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2391257363</pqid></control><display><type>article</type><title>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</title><source>IEEE Electronic Library (IEL)</source><creator>Xia, Wen ; Zou, Xiangyu ; Jiang, Hong ; Zhou, Yukun ; Liu, Chuanyi ; Feng, Dan ; Hua, Yu ; Hu, Yuchong ; Zhang, Yucheng</creator><creatorcontrib>Xia, Wen ; Zou, Xiangyu ; Jiang, Hong ; Zhou, Yukun ; Liu, Chuanyi ; Feng, Dan ; Hua, Yu ; Hu, Yuchong ; Zhang, Yucheng</creatorcontrib><description>Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.</description><identifier>ISSN: 1045-9219</identifier><identifier>EISSN: 1558-2183</identifier><identifier>DOI: 10.1109/TPDS.2020.2984632</identifier><identifier>CODEN: ITDSEO</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Acceleration ; content-defined chunking ; Data deduplication ; Data transmission ; Distributed databases ; Gears ; Microsoft Windows ; Normalizing ; performance evaluation ; Power capacitors ; Redundancy ; Size distribution ; storage system ; Storage systems ; Throughput</subject><ispartof>IEEE transactions on parallel and distributed systems, 2020-09, Vol.31 (9), p.2017-2031</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</citedby><cites>FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</cites><orcidid>0000-0003-1265-7141 ; 0000-0001-5104-8301 ; 0000-0003-4093-6391 ; 0000-0001-7716-1214 ; 0000-0002-1477-9751 ; 0000-0003-0774-462X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9055082$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9055082$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Xia, Wen</creatorcontrib><creatorcontrib>Zou, Xiangyu</creatorcontrib><creatorcontrib>Jiang, Hong</creatorcontrib><creatorcontrib>Zhou, Yukun</creatorcontrib><creatorcontrib>Liu, Chuanyi</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Hua, Yu</creatorcontrib><creatorcontrib>Hu, Yuchong</creatorcontrib><creatorcontrib>Zhang, Yucheng</creatorcontrib><title>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</title><title>IEEE transactions on parallel and distributed systems</title><addtitle>TPDS</addtitle><description>Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.</description><subject>Acceleration</subject><subject>content-defined chunking</subject><subject>Data deduplication</subject><subject>Data transmission</subject><subject>Distributed databases</subject><subject>Gears</subject><subject>Microsoft Windows</subject><subject>Normalizing</subject><subject>performance evaluation</subject><subject>Power capacitors</subject><subject>Redundancy</subject><subject>Size distribution</subject><subject>storage system</subject><subject>Storage systems</subject><subject>Throughput</subject><issn>1045-9219</issn><issn>1558-2183</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1LAzEQQIMoWKs_QLwEPG_NZ3dz1F2rQkGhFbyF7O6k3domNcke-u_dUvE0c3hvBh5Ct5RMKCXqYflRLSaMMDJhqhBTzs7QiEpZZIwW_HzYiZCZYlRdoqsYN4RQIYkYoa_lGnAFsVs57C2emZhw6V0Cl7IKbOegxeW6d9-dW2HrA65MMoPQ9vtt15jUeYefTByoRfLBrAAvDjHBLl6jC2u2EW7-5hh9zp6X5Ws2f395Kx_nWcMUTxmdEpHXtWwUmJYVhMsGiLStsHkuaS25BdpAXrTGUmIYq4HmRhkmqOCN4JaP0f3p7j74nx5i0hvfBze81IwrymTOp3yg6Ilqgo8xgNX70O1MOGhK9DGgPgbUx4D6L-Dg3J2cDgD-eUWkJAXjvzs_a8s</recordid><startdate>20200901</startdate><enddate>20200901</enddate><creator>Xia, Wen</creator><creator>Zou, Xiangyu</creator><creator>Jiang, Hong</creator><creator>Zhou, Yukun</creator><creator>Liu, Chuanyi</creator><creator>Feng, Dan</creator><creator>Hua, Yu</creator><creator>Hu, Yuchong</creator><creator>Zhang, Yucheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1265-7141</orcidid><orcidid>https://orcid.org/0000-0001-5104-8301</orcidid><orcidid>https://orcid.org/0000-0003-4093-6391</orcidid><orcidid>https://orcid.org/0000-0001-7716-1214</orcidid><orcidid>https://orcid.org/0000-0002-1477-9751</orcidid><orcidid>https://orcid.org/0000-0003-0774-462X</orcidid></search><sort><creationdate>20200901</creationdate><title>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</title><author>Xia, Wen ; Zou, Xiangyu ; Jiang, Hong ; Zhou, Yukun ; Liu, Chuanyi ; Feng, Dan ; Hua, Yu ; Hu, Yuchong ; Zhang, Yucheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c293t-16047bb5c9ead28035ce05fd4f7751b53fe1ce78daf10a22be17a9a24143c43f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Acceleration</topic><topic>content-defined chunking</topic><topic>Data deduplication</topic><topic>Data transmission</topic><topic>Distributed databases</topic><topic>Gears</topic><topic>Microsoft Windows</topic><topic>Normalizing</topic><topic>performance evaluation</topic><topic>Power capacitors</topic><topic>Redundancy</topic><topic>Size distribution</topic><topic>storage system</topic><topic>Storage systems</topic><topic>Throughput</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xia, Wen</creatorcontrib><creatorcontrib>Zou, Xiangyu</creatorcontrib><creatorcontrib>Jiang, Hong</creatorcontrib><creatorcontrib>Zhou, Yukun</creatorcontrib><creatorcontrib>Liu, Chuanyi</creatorcontrib><creatorcontrib>Feng, Dan</creatorcontrib><creatorcontrib>Hua, Yu</creatorcontrib><creatorcontrib>Hu, Yuchong</creatorcontrib><creatorcontrib>Zhang, Yucheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on parallel and distributed systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xia, Wen</au><au>Zou, Xiangyu</au><au>Jiang, Hong</au><au>Zhou, Yukun</au><au>Liu, Chuanyi</au><au>Feng, Dan</au><au>Hua, Yu</au><au>Hu, Yuchong</au><au>Zhang, Yucheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems</atitle><jtitle>IEEE transactions on parallel and distributed systems</jtitle><stitle>TPDS</stitle><date>2020-09-01</date><risdate>2020</risdate><volume>31</volume><issue>9</issue><spage>2017</spage><epage>2031</epage><pages>2017-2031</pages><issn>1045-9219</issn><eissn>1558-2183</eissn><coden>ITDSEO</coden><abstract>Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TPDS.2020.2984632</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0003-1265-7141</orcidid><orcidid>https://orcid.org/0000-0001-5104-8301</orcidid><orcidid>https://orcid.org/0000-0003-4093-6391</orcidid><orcidid>https://orcid.org/0000-0001-7716-1214</orcidid><orcidid>https://orcid.org/0000-0002-1477-9751</orcidid><orcidid>https://orcid.org/0000-0003-0774-462X</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1045-9219
ispartof	IEEE transactions on parallel and distributed systems, 2020-09, Vol.31 (9), p.2017-2031
issn	1045-9219 1558-2183
language	eng
recordid	cdi_proquest_journals_2391257363
source	IEEE Electronic Library (IEL)
subjects	Acceleration content-defined chunking Data deduplication Data transmission Distributed databases Gears Microsoft Windows Normalizing performance evaluation Power capacitors Redundancy Size distribution storage system Storage systems Throughput
title	The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T00%3A38%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Design%20of%20Fast%20Content-Defined%20Chunking%20for%20Data%20Deduplication%20Based%20Storage%20Systems&rft.jtitle=IEEE%20transactions%20on%20parallel%20and%20distributed%20systems&rft.au=Xia,%20Wen&rft.date=2020-09-01&rft.volume=31&rft.issue=9&rft.spage=2017&rft.epage=2031&rft.pages=2017-2031&rft.issn=1045-9219&rft.eissn=1558-2183&rft.coden=ITDSEO&rft_id=info:doi/10.1109/TPDS.2020.2984632&rft_dat=%3Cproquest_RIE%3E2391257363%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2391257363&rft_id=info:pmid/&rft_ieee_id=9055082&rfr_iscdi=true