The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression

Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of the very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely-used resemblance detection appr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on storage 2023-08, Vol.19 (3), p.1-30
Hauptverfasser:	Xia, Wen, Pu, Lifeng, Zou, Xiangyu, Shilane, Philip, Li, Shiyi, Zhang, Haijun, Wang, Xuan
Format:	Artikel
Sprache:	eng
Schlagworte:	Deduplication Extraction, transformation and loading Information systems Similarity measures
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	30
container_issue	3
container_start_page	1
container_title	ACM transactions on storage
container_volume	19
creator	Xia, Wen Pu, Lifeng Zou, Xiangyu Shilane, Philip Li, Shiyi Zhang, Haijun Wang, Xuan
description	Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of the very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely-used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte-by-byte across data chunks and ② applying multiple transforms on all the calculated rolling hash values. In this paper, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and high compression ratio. Specifically, Odess first utilizes a novel Subwindow-based Parallel Rolling hash method (called SWPR) using SIMD (i.e., Single Instruction Multiple Data [1]) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Moreover, Odess uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set, and then quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼ 31.4 × and ∼ 7.9 × faster than the state-of-the-art N-Transform and Finesse (i.e., a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, Odess-based system’s throughput is about 3.20 × and 1.41 × higher than N-Transform and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼ 1.22 × higher compression ratio over Finesse.
doi_str_mv	10.1145/3584663
format	Article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3584663</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3584663</sourcerecordid><originalsourceid>FETCH-LOGICAL-a277t-3ef854cd20247cc88601ba7096aedd150dab460af6023ae2ac1f49f27c8339e53</originalsourceid><addsrcrecordid>eNo90EtLAzEQB_AgCtYq3j3l5mk17-wepQ8VCorU8zLNTtroPsomIn57u7b2MjPM_JjDn5Brzu44V_pe6lwZI0_IiGstM8kKeXqcrT0nFzF-MCaNUHpEPpcbpFOMYd3SztM5xEShregirDfpG4dK3zBis6qhdQNN6FLoWuq7ns68Dy5gm-hrF1M2xeprWwcHf2CKdQI66ZptjzHuNpfkzEMd8erQx-R9PltOnrLFy-Pz5GGRgbA2ZRJ9rpWrBBPKOpfnhvEVWFYYwKrimlWwUoaBN0xIQAGOe1V4YV0uZYFajsnt_q_ruxh79OW2Dw30PyVn5ZBRechoJ2_2ElxzRP_HX3OFYhY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression</title><source>ACM Digital Library Complete</source><creator>Xia, Wen ; Pu, Lifeng ; Zou, Xiangyu ; Shilane, Philip ; Li, Shiyi ; Zhang, Haijun ; Wang, Xuan</creator><creatorcontrib>Xia, Wen ; Pu, Lifeng ; Zou, Xiangyu ; Shilane, Philip ; Li, Shiyi ; Zhang, Haijun ; Wang, Xuan</creatorcontrib><description>Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of the very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely-used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte-by-byte across data chunks and ② applying multiple transforms on all the calculated rolling hash values. In this paper, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and high compression ratio. Specifically, Odess first utilizes a novel Subwindow-based Parallel Rolling hash method (called SWPR) using SIMD (i.e., Single Instruction Multiple Data [1]) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Moreover, Odess uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set, and then quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼ 31.4 × and ∼ 7.9 × faster than the state-of-the-art N-Transform and Finesse (i.e., a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, Odess-based system’s throughput is about 3.20 × and 1.41 × higher than N-Transform and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼ 1.22 × higher compression ratio over Finesse.</description><identifier>ISSN: 1553-3077</identifier><identifier>EISSN: 1553-3093</identifier><identifier>DOI: 10.1145/3584663</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Deduplication ; Extraction, transformation and loading ; Information systems ; Similarity measures</subject><ispartof>ACM transactions on storage, 2023-08, Vol.19 (3), p.1-30</ispartof><rights>Copyright held by the owner/author(s). Publication rights licensed to ACM.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a277t-3ef854cd20247cc88601ba7096aedd150dab460af6023ae2ac1f49f27c8339e53</citedby><cites>FETCH-LOGICAL-a277t-3ef854cd20247cc88601ba7096aedd150dab460af6023ae2ac1f49f27c8339e53</cites><orcidid>0000-0003-4093-6391 ; 0000-0002-1648-0227 ; 0000-0001-5104-8301 ; 0000-0002-9117-2590 ; 0000-0002-3512-0649 ; 0000-0003-1235-0502 ; 0000-0001-8206-6916</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3584663$$EPDF$$P50$$Gacm$$Hfree_for_read</linktopdf><link.rule.ids>314,776,780,2276,27901,27902,40172,76197</link.rule.ids></links><search><creatorcontrib>Xia, Wen</creatorcontrib><creatorcontrib>Pu, Lifeng</creatorcontrib><creatorcontrib>Zou, Xiangyu</creatorcontrib><creatorcontrib>Shilane, Philip</creatorcontrib><creatorcontrib>Li, Shiyi</creatorcontrib><creatorcontrib>Zhang, Haijun</creatorcontrib><creatorcontrib>Wang, Xuan</creatorcontrib><title>The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression</title><title>ACM transactions on storage</title><addtitle>ACM TOS</addtitle><description>Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of the very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely-used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte-by-byte across data chunks and ② applying multiple transforms on all the calculated rolling hash values. In this paper, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and high compression ratio. Specifically, Odess first utilizes a novel Subwindow-based Parallel Rolling hash method (called SWPR) using SIMD (i.e., Single Instruction Multiple Data [1]) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Moreover, Odess uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set, and then quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼ 31.4 × and ∼ 7.9 × faster than the state-of-the-art N-Transform and Finesse (i.e., a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, Odess-based system’s throughput is about 3.20 × and 1.41 × higher than N-Transform and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼ 1.22 × higher compression ratio over Finesse.</description><subject>Deduplication</subject><subject>Extraction, transformation and loading</subject><subject>Information systems</subject><subject>Similarity measures</subject><issn>1553-3077</issn><issn>1553-3093</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo90EtLAzEQB_AgCtYq3j3l5mk17-wepQ8VCorU8zLNTtroPsomIn57u7b2MjPM_JjDn5Brzu44V_pe6lwZI0_IiGstM8kKeXqcrT0nFzF-MCaNUHpEPpcbpFOMYd3SztM5xEShregirDfpG4dK3zBis6qhdQNN6FLoWuq7ns68Dy5gm-hrF1M2xeprWwcHf2CKdQI66ZptjzHuNpfkzEMd8erQx-R9PltOnrLFy-Pz5GGRgbA2ZRJ9rpWrBBPKOpfnhvEVWFYYwKrimlWwUoaBN0xIQAGOe1V4YV0uZYFajsnt_q_ruxh79OW2Dw30PyVn5ZBRechoJ2_2ElxzRP_HX3OFYhY</recordid><startdate>20230831</startdate><enddate>20230831</enddate><creator>Xia, Wen</creator><creator>Pu, Lifeng</creator><creator>Zou, Xiangyu</creator><creator>Shilane, Philip</creator><creator>Li, Shiyi</creator><creator>Zhang, Haijun</creator><creator>Wang, Xuan</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-4093-6391</orcidid><orcidid>https://orcid.org/0000-0002-1648-0227</orcidid><orcidid>https://orcid.org/0000-0001-5104-8301</orcidid><orcidid>https://orcid.org/0000-0002-9117-2590</orcidid><orcidid>https://orcid.org/0000-0002-3512-0649</orcidid><orcidid>https://orcid.org/0000-0003-1235-0502</orcidid><orcidid>https://orcid.org/0000-0001-8206-6916</orcidid></search><sort><creationdate>20230831</creationdate><title>The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression</title><author>Xia, Wen ; Pu, Lifeng ; Zou, Xiangyu ; Shilane, Philip ; Li, Shiyi ; Zhang, Haijun ; Wang, Xuan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a277t-3ef854cd20247cc88601ba7096aedd150dab460af6023ae2ac1f49f27c8339e53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Deduplication</topic><topic>Extraction, transformation and loading</topic><topic>Information systems</topic><topic>Similarity measures</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xia, Wen</creatorcontrib><creatorcontrib>Pu, Lifeng</creatorcontrib><creatorcontrib>Zou, Xiangyu</creatorcontrib><creatorcontrib>Shilane, Philip</creatorcontrib><creatorcontrib>Li, Shiyi</creatorcontrib><creatorcontrib>Zhang, Haijun</creatorcontrib><creatorcontrib>Wang, Xuan</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on storage</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xia, Wen</au><au>Pu, Lifeng</au><au>Zou, Xiangyu</au><au>Shilane, Philip</au><au>Li, Shiyi</au><au>Zhang, Haijun</au><au>Wang, Xuan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression</atitle><jtitle>ACM transactions on storage</jtitle><stitle>ACM TOS</stitle><date>2023-08-31</date><risdate>2023</risdate><volume>19</volume><issue>3</issue><spage>1</spage><epage>30</epage><pages>1-30</pages><issn>1553-3077</issn><eissn>1553-3093</eissn><abstract>Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of the very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely-used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte-by-byte across data chunks and ② applying multiple transforms on all the calculated rolling hash values. In this paper, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and high compression ratio. Specifically, Odess first utilizes a novel Subwindow-based Parallel Rolling hash method (called SWPR) using SIMD (i.e., Single Instruction Multiple Data [1]) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Moreover, Odess uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set, and then quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼ 31.4 × and ∼ 7.9 × faster than the state-of-the-art N-Transform and Finesse (i.e., a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, Odess-based system’s throughput is about 3.20 × and 1.41 × higher than N-Transform and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼ 1.22 × higher compression ratio over Finesse.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3584663</doi><tpages>30</tpages><orcidid>https://orcid.org/0000-0003-4093-6391</orcidid><orcidid>https://orcid.org/0000-0002-1648-0227</orcidid><orcidid>https://orcid.org/0000-0001-5104-8301</orcidid><orcidid>https://orcid.org/0000-0002-9117-2590</orcidid><orcidid>https://orcid.org/0000-0002-3512-0649</orcidid><orcidid>https://orcid.org/0000-0003-1235-0502</orcidid><orcidid>https://orcid.org/0000-0001-8206-6916</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1553-3077
ispartof	ACM transactions on storage, 2023-08, Vol.19 (3), p.1-30
issn	1553-3077 1553-3093
language	eng
recordid	cdi_crossref_primary_10_1145_3584663
source	ACM Digital Library Complete
subjects	Deduplication Extraction, transformation and loading Information systems Similarity measures
title	The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T17%3A25%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Design%20of%20Fast%20and%20Lightweight%20Resemblance%20Detection%20for%20Efficient%20Post-Deduplication%20Delta%20Compression&rft.jtitle=ACM%20transactions%20on%20storage&rft.au=Xia,%20Wen&rft.date=2023-08-31&rft.volume=19&rft.issue=3&rft.spage=1&rft.epage=30&rft.pages=1-30&rft.issn=1553-3077&rft.eissn=1553-3093&rft_id=info:doi/10.1145/3584663&rft_dat=%3Cacm_cross%3E3584663%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true